drew2a / ivory-tower

đŸ—Œ collection of software development tips and principles
5 stars 0 forks source link

Journal #1

Open drew2a opened 1 year ago

drew2a commented 1 year ago

Short updates. Most of them work-related.

drew2a commented 1 year ago

Issues history in the Tribler repo via https://github.com/MorrisJobke/github-issue-counter

tribler

drew2a commented 1 year ago

An animated history of Tribler from the Beginning of Time to the Present Day. Enjoy watching (the image is clickable):

Animated history of Tribler

drew2a commented 1 year ago

An animated history of Superapp:

Animated history of Superapp

drew2a commented 1 year ago

Awesome Distributed Systems:

drew2a commented 1 year ago

TIL: how to build libtorrent 1.2.18 on Apple Silicone:

git clone https://github.com/arvidn/libtorrent
cd libtorrent
git checkout v1.2.18

brew install python@3.11 openssl boost boost-build boost-python3

python3.11 setup.py build
python3.11 setup.py install
drew2a commented 1 year ago

Trying to untangle the tangle of TorrentChecker for a second day in a row.

In fact, it's even fascinating. I can see how different people have added code over time. Sometimes it was with mistakes. Then other people went in and corrected the errors without understanding the bigger picture. Then there were refactorings... It is fascinating.

https://github.com/Tribler/tribler/pull/7286

drew2a commented 1 year ago

One week with TorrentChecker.

It is mysterious. It is written in a way it shouldn't work. But it works. I've fixed as much as I could and finishing my PR now

Thanks to @kozlovsky for his help in unraveling how this works.

I leave here one example of how it is written. All irrelevant parts of functions are replaced by ... to increase readability.


Here is the function check_local_torrents that makes self.check_torrent_health(bytes(random_torrent.infohash)) call (looks like a simple sync call):

    @db_session
    def check_local_torrents(self):
        ...
        for random_torrent in selected_torrents:
            self.check_torrent_health(bytes(random_torrent.infohash))
            ...
        ...

The listing of the self.check_torrent_health(bytes(random_torrent.infohash)) (which appears to be async):

    @task
    async def check_torrent_health(self, infohash, timeout=20, scrape_now=False):
        ...

Below we can see a sync call to an async function which shouldn't lead to the execution of the async function. But it's not that simple. The execution of this call will be successful.

The secret here is a decorator @task. The @task decorator converts an async function to a sync function that starts an async task in the background:

def task(func):
    """
    Register a TaskManager function as an anonymous task and return the Task
    object so that it can be awaited if needed. Any exceptions will be logged.
    Note that if awaited, exceptions will still need to be handled.
    """
    if not iscoroutinefunction(func):
        raise TypeError('Task decorator should be used with coroutine functions only!')

    @wraps(func)
    def wrapper(self, *args, **kwargs):
        return self.register_anonymous_task(func.__name__,
                                            ensure_future(func(self, *args, **kwargs)),
                                            ignore=(Exception,))
    return wrapper

This trick makes code much harder to read and neglects sync-async separation in python.

The source: https://github.com/Tribler/tribler/blob/87916f705eb7e52da828a14496b02db8d61ed5e9/src/tribler/core/components/torrent_checker/torrent_checker/torrent_checker.py#L222-L233

drew2a commented 1 year ago

I've found a nice and super simple tool for creating gifs

LICEcap can capture an area of your desktop and save it directly to .GIF (for viewing in web browsers, etc) or .LCF (see below).

https://www.cockos.com/licecap/

Example: seeder_leechers

drew2a commented 1 year ago

@kozlovsky handed me a shocking secret:

If you have ever used asyncio.create_task you may have created a bug for yourself that is challenging (read almost impossible) to reproduce. If it occurs, your code will likely fail in unpredictable ways.

The root cause of this Heisenbug is that if you don't hold a reference to the task object returned by create_task then the task may disappear without warning when Python runs garbage collection. In other words, the code in your task will stop running with no obvious indication why.

https://textual.textualize.io/blog/2023/02/11/the-heisenbug-lurking-in-your-async-code/

drew2a commented 1 year ago

Using Environment Protection Rules to Secure Secrets When Building External Forks with pull_request_target đŸ€: https://dev.to/petrsvihlik/using-environment-protection-rules-to-secure-secrets-when-building-external-forks-with-pullrequesttarget-hci

drew2a commented 1 year ago

AsyncGroup, developed in collaboration with @kozlovsky getting better and better 🚀 .

It could be used as a lightweight replacement of TaskManager and also could be replaced by itself with native Task Groups in the future (available since python 3.11)

https://github.com/Tribler/tribler/pull/7306

drew2a commented 1 year ago

I was accidentally invited to Google's programming challenge, but I just completed the first task and I really enjoyed it! It gave me a feeling of nostalgia, like the good old AOC days.

foobar

drew2a commented 1 year ago

TIL:

Web3 Sybil avoidance using network latency. https://www.sciencedirect.com/science/article/pii/S1389128623001469


Vector Clock: https://en.wikipedia.org/wiki/Vector_clock

Bloom Clock: https://arxiv.org/pdf/1905.13064.pdf

The bloom clock is a space-efficient, probabilistic data structure designed to determine the partial order of events in highly distributed systems. The bloom clock, like the vector clock, can autonomously detect causality violations by comparing its logical timestamps. Unlike the vector clock, the space complexity of the bloom clock does not depend on the number of nodes in a system. Instead it depends on a set of chosen parameters that determine its confidence interval, i.e. false positive rate. To reduce the space complexity from which the vector clock suffers, the bloom clock uses a ”moving window” in which the partial order of events can be inferred with high confidence. If two clocks are not comparable, the bloom clock can always deduce it, i.e. false negatives are not possible. If two clocks are comparable, the bloom clock can calculate the confidence of that statement, i.e. it can compute the false positive rate between comparable pairs of clocks. By choosing an acceptable threshold for the false positive rate, the bloom clock can properly compare the order of its timestamps, with that of other nodes in a highly accurate and space efficient way

drew2a commented 10 months ago

The Google standard for code review: https://google.github.io/eng-practices/review/reviewer/standard.html

Thus, we get the following rule as the standard we expect in code reviews:

In general, reviewers should favor approving a CL once it is in a state where it definitely improves the overall code health of the system being worked on, even if the CL isn’t perfect.

That is the senior principle among all of the code review guidelines.

A key point here is that there is no such thing as “perfect” code—there is only better code. Reviewers should not require the author to polish every tiny piece of a CL before granting approval. Rather, the reviewer should balance out the need to make forward progress compared to the importance of the changes they are suggesting. Instead of seeking perfection, what a reviewer should seek is continuous improvement. A CL that, as a whole, improves the maintainability, readability, and understandability of the system shouldn’t be delayed for days or weeks because it isn’t “perfect.”

drew2a commented 9 months ago

Today, driven by curiosity, I delved into figuring out how long Tribler has been operational in each specific release. To simplify the task I assumed that the release age is equal to the release branch age.

Surprisingly it turned out to be quite a complex task, as there isn't a straightforward way to calculate the age of a branch (age = last commit - fork commit) available online.

Eventually, I managed to write a script, with assistance from ChatGPT (my concept, its implementation), that can provide these numbers. However, verifying the accuracy of the results is challenging.

Branch: origin/release/7.13
    Latest commit date: 2023-11-14 13:01:47+01:00
    Fork date: 2023-03-16 19:05:54+01:00
    Fork commit: 81eb2495bd173c755ea0175ad30a6d4a37c7bc58
    Age: 242 days
Branch: origin/release/7.12
    Latest commit date: 2022-09-20 13:20:41+03:00
    Fork date: 2022-04-01 11:55:43+02:00
    Fork commit: 50e84df930127c4e63aa0eedbb106252ebab325e
    Age: 172 days
Branch: origin/release/7.11
    Latest commit date: 2021-12-27 20:34:47+01:00
    Fork date: 2021-11-03 16:15:02+01:00
    Fork commit: f94a1e451ba5c32662be4d4ec0c0e30274aa3d77
    Age: 54 days
Branch: origin/release/7.10
    Latest commit date: 2021-08-06 16:59:42+02:00
    Fork date: 2021-05-31 18:54:38+02:00
    Fork commit: 441a7d8fa222254e1e801697c8c1d51cd41dca82
    Age: 66 days
Branch: origin/release/7.9
    Latest commit date: 2021-04-01 14:58:27+02:00
    Fork date: 2021-03-18 19:49:27+01:00
    Fork commit: fc2a6411fa199fa2ec9d81d565d5d9f4f5b5e445
    Age: 13 days
Branch: origin/release/7.8
    Latest commit date: 2021-02-12 13:11:56+01:00
    Fork date: 2021-01-27 11:22:51+01:00
    Fork commit: 86315c39ab2905b602efe96398d89ce594dbfd98
    Age: 16 days
Branch: origin/release-7.6
    Latest commit date: 2020-12-09 11:33:08+01:00
    Fork date: 2020-12-05 18:29:22+01:00
    Fork commit: 0c841fdf36e5497231f6f79d5451e74163a48ac3
    Age: 3 days
Branch: origin/release-7.3.0
    Latest commit date: 2019-08-27 12:42:49+02:00
    Fork date: 2019-07-18 12:14:07+02:00
    Fork commit: 31295da5889400222bf9e7ccebe9002f7b0509fe
    Age: 40 days
image
import datetime
import subprocess

def run_git_command(command, repo_path):
    return subprocess.check_output(command, cwd=repo_path, shell=True).decode('utf-8').strip()

# Repository path
repo_path = 'path/to/tribler/repo'

# Fetch all branches
run_git_command('git fetch --all', repo_path)

# Get all release branches
branches = run_git_command('git branch -r', repo_path).split('\n')
release_branches = [branch for branch in branches if 'origin/release' in branch]

info = {}
for branch in release_branches:
    branch = branch.strip()

    # Find the oldest commit in the main branch that's not in the release branch
    oldest_commit = run_git_command(f'git log --pretty=format:"%h" {branch}..origin/main | tail -1', repo_path)

    # Find the fork commit
    fork_commit = run_git_command(f'git merge-base {oldest_commit} {branch}', repo_path)

    # Get the dates for the fork commit and the latest commit
    fork_commit_date = run_git_command(f'git show -s --format=%ci {fork_commit}', repo_path)
    latest_commit_date = run_git_command(f'git log -1 --format=%ci {branch}', repo_path)

    # Convert dates to datetime objects
    fork_commit_date = datetime.datetime.strptime(fork_commit_date, '%Y-%m-%d %H:%M:%S %z')
    latest_commit_date = datetime.datetime.strptime(latest_commit_date, '%Y-%m-%d %H:%M:%S %z')

    # Calculate the age of the branch in days
    age = latest_commit_date - fork_commit_date
    age_days = age.days
    s = f"Branch: {branch}\n" \
        f"\tLatest commit date: {latest_commit_date}\n" \
        f"\tFork date: {fork_commit_date}\n" \
        f"\tFork commit: {fork_commit}\n" \
        f"\tAge: {age_days} days"

    info[latest_commit_date] = s

for d in sorted(info.keys(), reverse=True):
    print(info[d])
drew2a commented 9 months ago

The annual (seemingly traditional) analysis of the Tribler repo:

Tribler

Prerequisites:

pip install git-of-theseus
git clone https://github.com/Tribler/tribler
git-of-theseus-analyze tribler

Also, I assume that the correct .mailmap is present in the repo.

Results

git-of-theseus-stack-plot cohorts.json

1

git-of-theseus-stack-plot authors.json --normalize

stack_plot

git-of-theseus-stack-plot authors.json 

stack_plot

git-of-theseus-stack-plot exts.json

stack_plot

git-of-theseus-survival-plot survival.json

4

IPv8

Prerequisites:

pip install git-of-theseus
git https://github.com/Tribler/py-ipv8
git-of-theseus-analyze py-ipv8

Results

git-of-theseus-stack-plot cohorts.json

1

git-of-theseus-stack-plot authors.json --normalize

2

git-of-theseus-stack-plot authors.json

3

git-of-theseus-stack-plot exts.json

4

git-of-theseus-survival-plot survival.json

5

Acknowledgements

Powered by https://github.com/erikbern/git-of-theseus

synctext commented 9 months ago

stunning and impressive! @drew2a Can you possibly update the above figures by adding also the cardinal py-IPv8 dependancy in together with the Tribler code count and code authorship records (Tribler :heavy_plus_sign: IPv8)? Something magical got born in 2017 :grin:

drew2a commented 9 months ago

Merged Tribler and IPv8

To create plots with merged data, I utilized this branch https://github.com/erikbern/git-of-theseus/pull/70


git-of-theseus-stack-plot tribler/authors.json ipv8/authors.json

stack_plot


git-of-theseus-stack-plot tribler/authors.json ipv8/authors.json --normalize

stack_plot

git-of-theseus-stack-plot tribler/cohorts.json ipv8/cohorts.json

stack_plot


git-of-theseus-stack-plot tribler/survival.json ipv8/survival.json --normalize

survival_plot

drew2a commented 8 months ago

Another question that piqued my curiosity was how the count of Tribler's open bugs changes over time.

The scripts: https://gist.github.com/drew2a/3eec7389359a57737b06c1991bf2c2a3

open_bugs open_bugs_and_releases rainbow_rocket

drew2a commented 8 months ago

Visualization for the open issues (last 60 days):

How to use:

  1. Fetch the issues: https://gist.github.com/drew2a/3eec7389359a57737b06c1991bf2c2a3#file-fetch_issues_and_releases-py
  2. Plot the visualisation: https://gist.github.com/drew2a/3eec7389359a57737b06c1991bf2c2a3#file- plot_issues_last_two_month-py

An example for the Tribler repo: https://github.com/Tribler/tribler issues

drew2a commented 7 months ago

Driven by curiosity about how the number of continuous contributors changes over time for Tribler, I started with a visualization of Tribler contributors over the last year, using a 7-day window and 1-day granularity:

1_year

The "window" refers to the maximum allowed gap between consecutive commits to be considered as part of the same activity period. In this case, a 7-day window means that if the gap between two commits is less than or equal to 7 days, they are considered part of a continuous contribution period. "Granularity" refers to the minimum length of time that a contribution period must be to be considered. Here, a 1-day granularity means that any period shorter than 1 day is extended to 1 day.

Then I got a visualization of Tribler contributors over the last 5 years, using a 30-day window and 14 days of granularity:

5_years

The same plot but filtered by contributors who contributed at least two days in total:

5_years_2_days

Here, the "at least two days in total" filter means that only contributors who have made commits on two or more separate days throughout the entire period are included.

Last 10 years, all contributors that contributed at least 2 days, plotted using a 60-day window and 14-day granularity:

10_years

Contributors from all Tribler history that contributed at least 2 days, plotted using a 90-day window and 30-day granularity:

all_history

Contributors from all Tribler history that contributed at least 90 days, plotted using a 90-day window and 30-day granularity:

all_history_90

In the last plot, the filter is applied to include only those contributors who have a total of at least 90 days of contributions throughout the entire history of Tribler. This filter, combined with a 90-day window and 30-day granularity, provides a long-term perspective on contributor engagement. The 90-day window means that consecutive commits within this period are considered as continuous contributions, while the 30-day granularity extends shorter contribution periods to 30 days, ensuring that each period reflects a significant amount of activity.

These visualizations provide valuable insights into the dynamics of the Tribler project's contributor base, highlighting both short-term and long-term contribution patterns. By adjusting the window and granularity parameters, as well as the contribution duration filter, we can observe different aspects of contributor engagement and project activity over time.

The script: https://gist.github.com/drew2a/b05141a13c8d0c85c041714bba44b2d3#file-plot_number_of_contributors-py

Using the obtained data, it's straightforward to plot the number of contributors over time:

over_time

This graph shows the fluctuation in the number of active contributors at any given time. The number of contributors is calculated based on their continuous activity periods, considering the window and granularity settings used in the analysis.

To smooth out the variations and make the plot less jumpy, we can increase the window to a longer duration, such as half a year:

all_time_window

By extending the window, we're considering a longer period of inactivity as continuous contribution, which smoothens the curve. It shows us a more averaged view of the contributor engagement over time, reducing the impact of short-term fluctuations. This approach can be particularly useful for identifying long-term trends in contributor engagement.

The script: https://gist.github.com/drew2a/b05141a13c8d0c85c041714bba44b2d3#file-plot_number_of_contributors-py

drew2a commented 1 month ago

I'm working with Tribler during my last days, and then I'm moving to another project outside of academia. I have been job hunting in the Netherlands for about two months, sending around 50 CVs and cover letters. Three of them led to interviews, but the others were either ghosted or resulted in rejections. It was quite a challenging time, but thankfully, I had a lot of experience changing jobs, so I was prepared for this period. Starting in September, I will be working with Airborn. It's a private company with closed-source projects, so unfortunately, I won't be able to share much (if anything) about my new job.

I appreciate the freedom that @synctext gave me during my time with Tribler and its open-source DNA. It is a truly unique project with unique organizational principles that go above and beyond the beaten tracks of both product companies and academia. Despite the freedom, working on this project is not easy as it requires a developer to learn a new environment and adapt to it without the luxury of having information from the field's pioneers.

From the very beginning, I focused on the engineering part of the project rather than the scientific part, as I thought engineers were a more unique resource for Tribler, which it had lacked in the past. Despite that, I contributed to some scientific work as well, working with @devos50 on a distributed knowledge graph:

Another scientific project I worked on involved tackling a long-standing problem with content grouping:

Working on content grouping was particularly interesting to me for two reasons. First, I did it solo with @synctext supervising. Second, at the very beginning, I didn't believe it was possible to solve the problem, but after the initial experiments, I saw a path forward and followed it until the task was completed. Kudos to @synctext for his intuition.


I'm going to post a more detailed wrap-up dedicated to the project in the issue:

In this issue, I'm going to publish the accumulated visualized data regarding Tribler's history that I have been posting here for the last two years. This historical research started as a necessity for me to understand Tribler and its codebase, and then it became driven by my curiosity, which I see as the purest possible scientific motivation.

Time will tell how the knowledge we've gained will help the next generation. For now, I'm satisfied with my short scientific journey, even though it wasn't canonical.

drew2a commented 1 month ago

Age of release branches in months

Branch: origin/release/7.14
    Latest commit date: 2024-04-24 12:02:08+02:00
    Fork date: 2023-03-16 19:05:54+01:00
    Fork commit: 81eb2495bd173c755ea0175ad30a6d4a37c7bc58
    Age: 404 days
Branch: origin/release/7.13
    Latest commit date: 2024-03-27 13:06:16+01:00
    Fork date: 2023-03-16 19:05:54+01:00
    Fork commit: 81eb2495bd173c755ea0175ad30a6d4a37c7bc58
    Age: 376 days
Branch: origin/release/7.12
    Latest commit date: 2022-09-20 13:20:41+03:00
    Fork date: 2022-04-01 11:55:43+02:00
    Fork commit: 50e84df930127c4e63aa0eedbb106252ebab325e
    Age: 172 days
Branch: origin/release/7.11
    Latest commit date: 2021-12-27 20:34:47+01:00
    Fork date: 2021-11-03 16:15:02+01:00
    Fork commit: f94a1e451ba5c32662be4d4ec0c0e30274aa3d77
    Age: 54 days
Branch: origin/release/7.10
    Latest commit date: 2021-08-06 16:59:42+02:00
    Fork date: 2021-05-31 18:54:38+02:00
    Fork commit: 441a7d8fa222254e1e801697c8c1d51cd41dca82
    Age: 66 days
Branch: origin/release/7.9
    Latest commit date: 2021-04-01 14:58:27+02:00
    Fork date: 2021-03-18 19:49:27+01:00
    Fork commit: fc2a6411fa199fa2ec9d81d565d5d9f4f5b5e445
    Age: 13 days
Branch: origin/release/7.8
    Latest commit date: 2021-02-12 13:11:56+01:00
    Fork date: 2021-01-27 11:22:51+01:00
    Fork commit: 86315c39ab2905b602efe96398d89ce594dbfd98
    Age: 16 days
Branch: origin/release-7.6
    Latest commit date: 2020-12-09 11:33:08+01:00
    Fork date: 2020-12-05 18:29:22+01:00
    Fork commit: 0c841fdf36e5497231f6f79d5451e74163a48ac3
    Age: 3 days
Branch: origin/release-7.3.0
    Latest commit date: 2019-08-27 12:42:49+02:00
    Fork date: 2019-07-18 12:14:07+02:00
    Fork commit: 31295da5889400222bf9e7ccebe9002f7b0509fe
    Age: 40 days

branches

Tribler+ipv8 repos history analysis by git-of-theseus

authors

authors_normalized

cohorts

survival_plot

Open Bugs Over Time

issues

Number of contributors

All Contributors

window: 90d, granularity: 15d, contribution_duration: 1 all_contributors

Continuous Contributors

window: 90d, granularity: 1d, contribution_duration: 30 ccontributors2