Proposal: Introduce [Prototype] Tag for Experimental PRs

Tribler / tribler

Privacy enhanced BitTorrent client with P2P content discovery

https://www.tribler.org

GNU General Public License v3.0

4.77k stars 443 forks source link

Proposal: Introduce [Prototype] Tag for Experimental PRs #7799

Closed kozlovsky closed 4 months ago

kozlovsky commented 8 months ago

Background

Tribler uniquely straddles two worlds: it is both a robust, user-oriented product and a dynamic platform for advanced scientific research. This duality presents a challenging yet fascinating dichotomy in our development approach.

As a product, Tribler targets a broad user base, necessitating a high degree of stability across diverse operating systems. Achieving this requires adherence to stringent standards in development: our code needs to be not only functional but also extensible, well-documented, and rigorously tested. These practices ensure that Tribler remains reliable and adaptable, ready to meet the evolving needs of its users.

Conversely, as a vessel for scientific exploration, Tribler serves as a testbed for innovative and often experimental research. In this context, researchers may need to rapidly prototype to test hypotheses or explore new ideas. Such experimental coding, while invaluable for its insights, may not always align with the rigorous standards set for our user-facing product. It often involves a "quick and dirty" approach, with provisional code that is not intended for long-term integration into the codebase.

This dual nature of Tribler leads to a diverse code landscape where the requirements for stability and innovation coexist, sometimes uneasily. Recognizing and embracing these contrasting demands is crucial for our team to continue working productively and innovatively.

Challenge

From time to time, we encounter pull requests that get stuck in the review phase for a long time. In my opinion, such situations are often caused by developers' different assessments of the type of code added to a pull request. The pull request author considers the code he is adding to be a draft and a quick prototype, while the reviewer views the code as part of the Tribler foundation and expects strict criteria to be met.

Proposal: [Prototype] Tag

In this regard, I have a specific proposal: introduce a "Prototype" tag (or, alternatively, the prefix "[Prototype]") with which the author can tag his pull request if it contains experimental code, allowing for more flexible review standards:

Full test coverage is optional
The internal API of the implemented classes may not be perfect.
The structure of new tables in the database may not be universal enough.

Safeguards

According to the saying, "There is nothing more permanent than temporary." It is better to avoid situations where raw code written for a quick experiment becomes the basis for other code that can no longer be easily removed and refactored. So, to prevent prototype code from inadvertently becoming permanent and impacting users:

The PR author must demonstrate that the code won't negatively affect users and that any new database structures are easy to modify or remove.
Prototype code should be explicitly marked as such within the codebase.
Prototype code should not form the basis of user-facing features until it undergoes a thorough review and testing process.

Rationale

This strategy seeks to harmonize the need for swift experimental progress with the imperative of delivering a reliable user experience. It recognizes our dual identity as innovators in scientific research and developers of a user-centric product.

Feedback and thoughts on this proposal are welcome.

drew2a commented 8 months ago

Thank you for the proposition; any attempt to improve our processes should be praised.

Tribler is indeed a significant experiment and an academic product that partially serves users' needs and partially serves scientists' needs.

However, I don't see a necessity for dual code standards. Every PR opened in the repo intended to be merged into the main branch will eventually go to the user's machine. There should not be dual standards for the code, as all code will run on the user's machine and could lead to dangerous behavior (crashes, data leaks, etc.).

However, you are right that scientists sometimes need to experiment with Tribler. And they already have a tool for that (Jenkins, Gumby). There are two existing approaches:

To put the experimental code in the experiments folder https://github.com/Tribler/tribler/tree/main/scripts/experiments. Here, the code can be "dirty" and written "fast". This is acceptable as it does not affect our users.
To run an experiment from your own fork. With this approach, you don't even need any PR or approval.

If I'm mistaken and we indeed need this duality (to have stable code and unstable code), then it raises the necessity for quite a significant effort to maintain this new process. It's unlikely that we will be able to maintain this process effectively.

qstokkink commented 8 months ago

I almost fully agree with OP by @kozlovsky. However, I do disagree with "Full test coverage is optional". In my opinion, achieving full test coverage - even for experimental code - should be considered the lower bar for code contribution. In fact, I think we can pretty much stick to textbook Continuous Integration (CI), description from GitHub:

(CLICK ME TO EXPAND) https://docs.github.com/en/actions/automating-builds-and-tests/about-continuous-integration

>Continuous integration (CI) is a software practice that requires frequently committing code to a shared repository. Committing code more often detects errors sooner and reduces the amount of code a developer needs to debug when finding the source of an error. Frequent code updates also make it easier to merge changes from different members of a software development team. This is great for developers, who can spend more time writing code and less time debugging errors or resolving merge conflicts. > When you commit code to your repository, you can continuously build and test the code to make sure that the commit doesn't introduce errors. Your tests can include code linters (which check style formatting), security checks, code coverage, functional tests, and other custom checks.

This is a tried-and-true way to contribute code that is under active development. Reviews also become very easy. If the contributed code passes the PR checks, only some basic properties need to be reviewed: do the PR tests achieve coverage by asserting behavior, does the PR not disable any of the CI PR checks, does the PR not interfere with other code (strong coupling), does the PR actually implement (part of) the linked issue, etc. All very basic and quick manual review.

Of course - this where I agree with @drew2a - not all code is under active development. We have some code that has survived for ~15 years in our open source project. This is, essentially, ancient wisdom that has defined Tribler for generations and should be treated with some respect. That is not to say it cannot be scrutinized, but we can perform some heavier reviews on this type of code: it should definitely not be experimented with.

Regarding the need for experimental code: I'd say this is vital to the existence of Tribler. In my opinion, Tribler has no competitive advantage over other torrent clients when it comes to performance. I believe that the only reason that Tribler survives, is by virtue of being able to offer experimental features. We are able to deploy fundamental peer-to-peer algorithms (to better society) decentrally that nobody else can test at scale. This leads to collaborative opportunities with parties that are interested in such technology (allowing us to actually pay developers) and it empowers our users with features that no other torrent client can ever hope to offer. For example, one of those successful features is anonymous downloading.

The caveat of experimental code is that most experiments fail. For this reason, I'd like to see a fail-fast methodology implemented in the Tribler architecture. For "prototype" PRs, this means that code should be easy to add AND also easy to take away. I'd say the minimum requirement for this type of "prototype" code is a configuration option to enable and disable it.

xoriole commented 8 months ago

I also believe No code is permanent, and sooner or later will be replaced with something better. And, all code introduces technical debt (small or big) which has to be paid by future developers.

Regarding Tribler itself, the team wishes for it to be used by millions of users on one hand, and on the other hand, the majority of the work done on features is developed without any real validation. This leads to features being developed at a heavy cost later to be just removed and replaced by some other feature supposedly better than the previous one which is again not validated on the market.

Regarding the development process itself, I consider all features that are added to Tribler to be experimental code that should be treated the same way.

Here are my other opinions:

A common standard for the whole codebase, enforced by CI checks as much as possible.
Developers are free to develop the PRs in the best way/approach they think is right. The scope of the PR should be made clear upfront.
Reviewers review the code within the scope limited in the PR to detect actual issues or bugs on the code, unintended behaviors, etc avoiding merge/approval blockers based on improvements or enhancements that can be made on the PR. This should speed up the approval and merge of the PR.
For the improvements/enhancements that can be made on the PR, anybody (including the reviewers) are welcome to publish their PRs.

qstokkink commented 7 months ago

I observe that there is a sentiment to not have different standards for "normal" and "experimental", "prototype", or "new stuff" (whatever you want to call it). Like I mentioned before, I do think passing CI should be the common lower bar. However, I do offer the following calculation based on #7786:

It has taken me 3 days of full-time discussion to resolve 2 reviewer comments. The total number of reviewer comments is 64. The extrapolated number of days to address all of the reviewer feedback is therefore 96 days, or, almost four and a half months of 5-day working weeks spent on nothing but discussions, not development.

Obviously, I dropped the PR.

If we do want "experimental", "prototype", or "new stuff" something does have to change.

drew2a commented 7 months ago

It has taken me 3 days of full-time discussion to resolve 2 reviewer comments. The total number of reviewer comments is 64. The extrapolated number of days to address all of the reviewer feedback is therefore 96 days, or, almost four and a half months of 5-day working weeks spent on nothing but discussions, not development.

I think I should add another POV to the mentioned topic as I was one of the reviewers.

You can't extrapolate it like that. Not all comments are equal. Almost all of my comments, for instance, contain proposed code, and they don't change the logic of the requested discussion; they simply improve the code and can be applied with just a single button press. Some of them indeed question the way the feature is implemented, but I think it's worth responding to these questions during the review process instead of spending a lot more time later on understanding why they aren't working, rewriting them with new reviews, or removing them in the feature. It's worthwhile to spend an extra day discussing the DB structure instead of dealing with migrations in the future.

(⬆️ I'm assuming here that Tribler is more like a product for users rather than just experimental software from the academic world. This is highly debatable, and it's ultimately Johan's decision to set this "slider" and determine who we are.)

The time for answering comments could be dramatically shortened by discussing in person at the office, or even through an online call. We have done this many times with @kozlovsky, for example, and it has always worked. There's no need to add "prototype" code (which you agree is technical debt by definition). What's necessary is improving the "review addressing" process, such as not discussing minor things and discussing complicated things in person, etc.

It is the job of a project manager or team leader to monitor the time spent on PRs and to propose improvements to the review process, such as preventing developers from lengthy discussions in the comment thread and encouraging them to discuss issues in person, preferably in front of a whiteboard. Since we don't have that luxury, we need to manage this process ourselves, which requires agility from all involved parties.

Another example of a sub-optimal PR review process is ipv8. If you analyze PRs you see that there are almost zero comments in there and zero discussion about the way the PRs are implemented. Is it because the code is written ideally and the features are designed perfectly... I doubt it.

My point is that we should aim for a balance and not be at either extreme of the spectrum, being neither too strict nor too lenient.

Nobody writes perfect code. We learn through actions, and one of the most effective ways to learn "how to write good code" is through reviews in pull requests.

I fell into this trap myself about 10 years ago, when I had already been in the industry writing code full-time (C++, C#, assembler) for about 9 years. I thought my code was perfect due to my years of experience, my education, and the dozens of software development books I had read. However, I was in a bubble, and my perception was challenged when I began participating in the review process and received my first honest reviews. It took me some time to appreciate the criticism and understand that it is a valuable learning opportunity.

qstokkink commented 7 months ago

@drew2a Interesting read! Thanks for documenting the actual Tribler review process. This is new information to me and now your previous comment also makes more sense to me. I'll get back to this.

First, a short historical intermission. The "sub optimal" review process of IPv8 actually sounds perfect to me. In fact, I don't have the goal as a reviewer to teach my contributors "how to write good code", as you said. The opposite even: usually the IPv8 contributors have worked on some particular code for a long time (~10 years is not unheard of) and they school me as a reviewer on how to work with their code.

Getting back to the "new information". The review process of IPv8 (and Tribler in the past too) focused on deployment-first ("shift right testing", if you know the development books) and iterative development. This meant that nothing even close to "written ideally and the features are designed perfectly", as you said, ever got merged. It was always a process of iterative development. That is still the way I personally work on features and how I review PRs.

Perhaps I'm simply the "odd one out" in the team. If Tribler only merges ideally written features with perfect design, I simply cannot deliver this in a single PR. I need some space for iterative development. Long story short, I agree that working on a fork, as you suggested earlier, is then the only option. Because I make imperfect code, I would also need an issue tracker, so that would probably mean that the fork should also be a repository itself. Once features are fully mature, they can then be merged into Tribler.

Because my way of reviewing is fundamentally different from what Tribler uses, perhaps it also makes more sense if I pull out of Tribler/reviewers. Enforcing different standards at the same time will only lead to conflict. Of course, I'll keep making bug reports to help the Tribler devs out then.

synctext commented 7 months ago

Fascinating discussion :thought_balloon: Provocative thinking of having dual code standards for research and production. This triggered my personal alarm bells. We have more code, more unit tests, more user reported errors, more PR discussions, and we have been adding new scientific topics to our workload. We are publishing on "Decentralised AI" and other scientific cutting-edge topics such as decentralised learn-to-rank. We need unit tests for that and move beyond experimental code, before we can move that into production. Otherwise it will be very expensive to get stable. All this time we are adding to our lab workload, but never dropped any workload or hired new developers. I need to be less ambitious :crying_cat_face: Lets follow this balanced approach (not my idea, copied from comments above):

A common standard for the whole codebase, enforced by CI checks as much as possible. So, unit testing, application testing, mutation testing and linter. Not too strict: Black is the uncompromising Python code formatter.
Developers are free to develop the PRs in the best way/approach they think is right. The scope of the PR should be made clear upfront.
Reviewers review the code within the scope limited in the PR to detect actual issues or bugs on the code, unintended behaviors, etc avoiding merge/approval blockers based on improvements or enhancements that can be made on the PR. This should speed up the approval and merge of the PR. Each Pull request should be completed in 2-3 workdays. Minimise blocking and stale PRs.
For the improvements/enhancements that can be made on the PR, anybody (including the reviewers) are welcome to publish their PRs.

qstokkink commented 4 months ago

I have implemented the fork/repo idea. Experimental feature development is separate and the fork seems to be stable enough to develop on now. With that, I guess this issue is resolved.