Tribler / tribler

Privacy enhanced BitTorrent client with P2P content discovery
https://www.tribler.org
GNU General Public License v3.0
4.85k stars 451 forks source link

The Components` implementation: Improvements #6335

Closed drew2a closed 1 year ago

drew2a commented 3 years ago

We have scheduled a meeting to discuss the implementation of the components. You can add your notes to this ticket.

drew2a commented 3 years ago

Interfaces

For each of the components there are two types of objects:

My main notes for interfaces are:

  1. Interfaces are unnecessary here (we can remove them)
  2. Interfaces are not what they seem (technically they looks more like Facades (https://en.wikipedia.org/wiki/Facade_pattern))
  3. A part of the interface`s implementation has to be moved to the tests (see comments in the example below)
class PopularityComponent(Component):
    enable_in_gui_test_mode = True

    community: PopularityCommunity

    @classmethod
    def should_be_enabled(cls, config: TriblerConfig):
        return config.ipv8.enabled and config.popularity_community.enabled

    # this method should not be a part of component. It should be a part of the tests.
    @classmethod
    def make_implementation(cls, config: TriblerConfig, enable: bool):
        if enable:
            from tribler_core.components.implementation.popularity import PopularityComponentImp
            return PopularityComponentImp(cls)
        return PopularityComponentMock(cls)

# This class should be moved to the corresponding tests
@testcomponent
class PopularityComponentMock(PopularityComponent):
    community = Mock()

Folders

We have the following folder structure: https://github.com/Tribler/tribler/tree/main/src/tribler-core/tribler_core

Components and modules should be merged into one single folder (as new "components" is old "modules").

Comments

There are no comments for the new complex logic (base.py). We should add comments to make work with components easier, not harder.

An example:

https://github.com/Tribler/tribler/blob/a34e02723cb7215fb29f6d3a89c9064416ed49c6/src/tribler-core/tribler_core/components/base.py#L181-L188

synctext commented 3 years ago

If everybody agrees we need refactor, then sure. Otherwise I propose we relentlessly focus on users growth through Tribler improvements and accept the buildup of technical depth the remainder of the year.

ichorid commented 3 years ago

buildup of technical depth the remainder of the year.

@synctext The buildup of technical debt you mean? I feel we have enough technical depth in our design now :wink:

kozlovsky commented 3 years ago

@synctext I agree with your focus on features and improvements that can attract new users to Tribler. In the upcoming years, we can add many new exciting features to Tribler.

Components have one clear goal from the developer's perspective - allowing developers to add new features faster. For example, a new Tagging system definitely looks like a component to me. We need to discuss tomorrow does the current implementation satisfy this goal or it needs some changes.

Components architecture is intended to be a backbone of Tribler, and we need to be sure that the backbone is good enough to not collapse under the weight of new features. For that reason, I think the topics that @drew2a outlined are very good ones.

Having a working component architecture that all developers agree with can untie developer's hands and allow them to work on new features with improved speed. I believe that at this moment we already have a code that can be modified easily without spending a long additional time on it, so after the upcoming discussion, it should be easy to take into account our needs if they are not covered by the current approach.

synctext commented 3 years ago

sorry, I've been doing this for 16 years in Tribler, working on a generic framework to accelerate development time has been tried many times and consistently failed. :fearful:

Lesson from Dispersy, lesson from Libswift: never make a framework until you find yourself repeating the same thing 3 times. With tagging we're altering the core storage of metadata. It might evolve into a single unique piece of code which stores content with a trust-level. That will surely break all your assumptions. I'm hesitant to make anything "backbone" changes to Tribler. IPv8 communities are not broken, so our backbone does not need fixing.

Reflections on team composition

At the Tribler team we have 4 scientific developers and 5 researchers. Everybody is trained to think independently as a system architect. So we have 10 system architects basically, me included. We have a systematic bias towards making too many architectural changes; striving for godly perfection. That is a problem to be aware of. The opposite may also happen. Similar projects like IPFS have no system architects and struggle {e.g. failed} with finding a good system architecture that will scale. Their team composition seems to be too narrow focused on individuals with their own task, losing track of cardinal global issues. For instance, as somebody smart said: IPFS/Filecoin is using a centralised business model for a decentralised system. :exploding_head:

1989

kozlovsky commented 3 years ago

I'm excited about the prospect of an offline all-dev-team meeting for discussing components. I'm sure that with the collective discussion, we can make the components architecture much better.

The current version of the components architecture has not emerged from an empty initial state. It was an incremental refactoring of the previous approach that you can see here.

First, I want to describe the component architecture that we have right now. The current implementation of components is trying to achieve the following goals:

  1. Make clear statically typed boundaries between different parts of Tribler.
  2. Allow flexible configuration of Tribler when some components can be turned off with predictable consequences.
  3. Solve the problem of circular imports between components and avoid the loading of expensive modules for components that are disabled in the current Tribler configuration.
  4. Allow components to communicate directly without a mandatory use of an intermediate Session object.
  5. Allow asynchronous initialization of components.
  6. Allow assembling arbitrary configurations of components.

Let's see how the current components implementation achieves that.

Static typing of components

With the current approach, a developer can implement a **component interface**, which statically describes all fields provided by a component. It is possible to implement methods in a component interface. Still, usually, it is just some other objects provided by a component, such as a community, a store, or a manager implemented by some Tribler module. ```python class FooComponent(Component) community: FooCommunity manager: DownloadManager store: MetadataStore ... ``` A component is treated as a container for some values (such as communities) which are statically typed. A component interface specifies field types, while the component implementation provides the actual values of these fields. Some other component's implementation can specify its dependency on `FooComponent` in its implementation: ```python class BarComponentImp(BarComponent): ... async def run(self): ... foo_component = await self.use(FooComponent) # the type of foo_component variable is FooComponent community = foo_component.community # strongly typed! the type is FooCommunity manager = foo_component.manager # strongly typed! the type is DownloadManager store = foo_component.store # strongly typed! the type is MetadataStore ``` If your PyCharm is configured correctly and you try to play with current components in PyCharm IDE, you will see that PyCharm can perfectly understand that the result of the `self.use(FooComponent)` call has the `FooComponent` type. It is beneficial, as it allows PyCharm and various type checkers to highlight many errors early on. You can navigate easily through the code by clicking on a component field like `foo_component.manager`. That should bring you to the right place, where you can click on the field type to see the `DownloadManager` class definition. Also, you can click on `FooComponentImp` and see all details of the `DownloadManager` instance initialization. So, in responding to: > 2. Interfaces are not what they seem (technically, they looks more like Facades (https://en.wikipedia.org/wiki/Facade_pattern)) I disagree with that. Interfaces in the current implementation of components are real interfaces, as they: 1. statically define types of component fields. 2. does not provide actual values for these fields, leaving that for the component's implementation. I need to say that the component's interfaces in the current implementation are not *pure* interfaces, as, in addition to the Interface pattern, they also implement the Factory pattern. In the current component architecture, each component's interface and its implementation are coupled: each interface is linked to a single "real" implementation and a single "dummy" implementation. This is done for simplification: no Tribler interface needs to have more than one "real" implementation.

Flexible configuration of Tribler

Our current component architecture allows us to use a subset of components. Depending on configuration options, some components can be turned off. For this, a class of component's interface needs to provide `should_be_enabled` classmethod: ```python class FooComponent(Component) ... @classmethod def should_be_enabled(cls, config: TriblerConfig): return config.foo.enabled and config.xyz.enabled # elaborate conditions are possible ``` Again, the reason to place this classmethod in the component's interface class was to simplify architecture as much as possible and don't introduce unnecessary new entities. In the current Tribler architecture, it is enough to have this method hardcoded inside the component interface. We can extract it later when (and if) a reason for this will appear. It is possible that other components are trying to use component which is not currently enabled: ```python class BarComponentImp(BarComponent): ... async def run(self): ... foo_component = await self.use(FooComponent) # FooComponent can be disabled ``` What should be the value of the `foo_component` variable in that case? One option was to set it to `None`. But it turns out to be pretty inconvenient. Currently, a component is basically a container for some fields which are initialized indirectly in a Dependency Injection style. We don't use the component itself. We use fields that this component provides to us, like in the following example: ```python class BarComponentImp(BarComponent): ... async def run(self): ... foo_component = await self.use(FooComponent) community = foo_component.community manager = foo_component.manager store = foo_component.store ``` If `foo_component` can be `None`, then direct access to `foo_component.community` and other fields becomes impossible - we need to check if `foo_component is None` or not before that. In practice, it turns out to be inconvenient and error-prone. In order to avoid this, each component has a second dummy implementation which is returned when the component is disabled. This dummy implementation has the same set of fields as the real component implementation, so we can access the `foo_component.community` attribute even if `foo_component` is actually disabled. The actual value of `foo_component.community` for dummy implementation of `FooComponent` can be `None` or some dummy version of the community depending on what is more convenient for other components, which depends on `FooComponent`. Currently, the class for dummy `FooComponent` implementation is named `FooComponentMock`, but the name is actually pretty misleading. The `Mock` ending gives the impression that this class is for tests. It actually **can** be used for tests, especially if we are testing some other component and want to disable everything that is not relevant to the current test. But the main purpose of this class is to be used in real Tribler run when the component was disabled in configuration settings, but other components can attempt to use it. For that reason, I think that it is better to rename dummy components implementations from `FooComponentMock` to `DummyFooComponent` or something like that. An attribute `foo_component.enabled` allows us to understand whether we have an actual component implementation or its dummy version. This allows us to write code in the following style: ```python class XyzComponentImp(XyzComponent): ... async def run(self): ... foo_community = await self.use(FooComponent).community # will not crash if FooComponent is disabled bar_component = await self.use(BarComponent) # can be real or dummy implementation if bar_component.enabled: # this is the real implementation for sure # do something with fields of bar_component ```

How components solve the problem of circular imports

This part is straightforward. Each component's implementation refers to interfaces of other components, but not to their implementation: ```python from tribler_core.components.interfaces.foo import FooComponent ... class BarComponentImp(BarComponent): ... async def run(self): ... foo_component = await self.use(FooComponent) # the value of foo_component variable is FooComponentImp instance ``` This way, components can link each other easily. Cross-dependencies are possible and will not cause any problems during the module import. The same is with optional component dependencies. Let's consider the situation when `BarComponent` has a dependency on `FooComponent`, but that dependency is optional, so `BarComponent` can work fine if `FooComponent` is disabled in config settings. Then `BarComponent` can be written this way: ```python from tribler_core.components.interfaces.foo import FooComponent ... class BarComponentImp(BarComponent): ... async def run(self): ... foo_component = await self.use(FooComponent, required=False) # optional component dependency if foo_component.enabled: # do something useful with foo_component content ``` If `FooComponent` is disabled, then `BarComponent` will load `FooComponent` interface, but not the implementation. If the implementation of `FooComponent` contains imports of some expensive modules, this time will not be spent, and the Tribler startup time will be reduced.

Direct communication of components

In the previous code of Tribler, modules communicate with each other using a god-like `Session` object as an intermediary object. It was typical for `Session` to have all communities as its fields, and then other components should access `session.ipv8`, `session.dlmgr` etc., to access necessary objects. The result of it was a highly coupled architecture when all possible "components" should be specified in Session. It kind of works but might be error-prone when not all "components" are actually enabled. In the new component architecture, the components don't use the `Session` object at all to communicate with each other. Instead, they directly use other components in their code: ```python class BarComponentImp(BarComponent): ... async def run(self): ... foo_component = await self.use(FooComponent) # no usage of Session object here ``` This made component dependencies clean and explicit, and it became possible to understand which components are using other components easily.

Asynchronous initialization of components

There are some tricky questions regarding components initialization: * In which order components should be initialized? * What to do, if some component refers to another component that is not fully initialized yet? * What to do if some component requires a long time for its initialization? The simplest solution would be to specify some strict linear order for component initialization and then initialize components in that order one by one. This approach is possible but has two big drawbacks: * Someone needs to analyze the code of all components carefully to derive the correct initialization sequence. This may be error-prone. * Some components can take a really big and unpredictable time for initialization, and it may significantly delay the initialization of other components placed further in the initialization sequence. In the current architecture, another approach was implemented, which allows all components to be initialized asynchronously and in parallel. If some component A refers to another component B, the initialization process of component A is paused until component B becomes fully initialized. Then component A receives the reference to component B, and the initialization process of component A resumes. You can see it in the following example: ```python class BarComponentImp(BarComponent): ... async def run(self): ... foo_component = await self.use(FooComponent) # the execution is paused until FooComponent is ready ``` A similar logic is applied to the component shutdown - a component finalization procedure starts when all other components that depend on this component were already stopped. With the current implementation, it is possible to have a graph of dependencies between components to detect cyclic dependencies during component initialization. It was already implemented but then removed to simplify code a bit, as it was unnecessary to have this detection mechanism. It can be added again afterward.

Assembling arbitrary configurations of components

In the current component architecture, the role of `Session` was heavily reduced. Previously the session object was the central place where references to all important "components" were stored, and many "components" access the session object to find references to other "components". Now components can talk to each other directly without the direct use of the `Session` object. Nevertheless, the Session object remains an important part of the architecture. Its main purpose now is to keep the mapping of component interfaces to component implementations. As I said previously, now each component interface has two implementations - one is "real" implementation which is used in the normal Tribler run, and another is "dummy" implementation which is used when the component is disabled. But in tests, it may be important to replace some specific components with mock implementations. I'm speaking not about classes like `FooComponentMocks`, that really should be renamed to something like `DummyFooComponent`, and that is used in the real application when the corresponding component is disabled, but about some test-specific mocks which can be implemented differently in different tests. `Session` object in a new component architecture can assemble an arbitrary collection of component implementations in the following way: ```python components = [foo_mock, bar_component] session = Session(tribler_config, components) with session: await session.start() ``` Here, `foo_mock` is some mock implementation that inherits from the `FooComponent` interface. It should have an `interface` attribute that refers to that interface. Then the code of the `bar_component` will be executed: ```python class BarComponentImp(BarComponent): ... async def run(self): ... foo_component = await self.use(FooComponent) # will return the foo_mock instance ``` This way, the current component architecture allows you to test arbitrary assemblies of components with arbitrary implementation for each component.
kozlovsky commented 3 years ago

Now I can return to @drew2a notes about the current realization of components:

For each of the components, there are two types of objects:

  • Implementation
  • Interface

My main notes for interfaces are:

  1. Interfaces are unnecessary here (we can remove them)

Speaking about component objects, it is useful to separate the interface and the implementation of a component for the following reasons:

Speaking about component modules, you are correct that we can put an interface and the main implementation of the same component into the same module. But having different modules for interface and implementation is helpful for two reasons:


  1. Interfaces are not what they seem (technically, they looks more like Facades (https://en.wikipedia.org/wiki/Facade_pattern))

As I already wrote above, I disagree with this. In the current implementation, interfaces describe component fields in a strongly typed way. Inside your component, you can write (await self.use(LibtorrentComponent)).download_manager, and PyCharm will correctly identify that the result's type is a DownloadManager instance.

Let me compare the current way how components work with the previous iteration of the component architecture.

On the previous iteration, components indeed do not have interfaces. Each component had a single value which was provided by that component:

class TunnelsComponent(Component):
    role = TUNNELS_COMMUNITY
    ...
    async def run(self, mediator):
        await super().run(mediator)
        config = mediator.config

        ipv8 = await self.use(mediator, IPV8_SERVICE)
        bandwidth_community = await self.use(mediator, BANDWIDTH_ACCOUNTING_COMMUNITY)
        peer = await self.use(mediator, MY_PEER)
        dht_community = await self.use(mediator, DHT_DISCOVERY_COMMUNITY)
        download_manager = await self.use(mediator, DOWNLOAD_MANAGER)
        bootstrapper = await self.use(mediator, IPV8_BOOTSTRAPPER)
        rest_manager = await self.use(mediator, REST_MANAGER)

In my opinion, this approach had the following drawbacks:

  1. The value returned by a component was untyped. Because of this, it was harder to analyze code properly - PyCharm cannot guess what is the type of ipv8 or peer here.
  2. A component could provide a single value only. So, in the code example above, it was necessary to have a group of three components: IPV8_SERVICE, MY_PEER, and DHT_DISCOVERY_COMMUNITY to pass three related values to TunnelsComponent.

With component interfaces, we now have typed values, and a single component can hold more than one value. Now the same code looks like this:

class TunnelsComponentImp(TunnelsComponent):
    async def run(self):
        await self.use(ReporterComponent, required=False)

        config = self.session.config
        ipv8_component = await self.use(Ipv8Component)
        ipv8 = ipv8_component.ipv8
        peer = ipv8_component.peer
        dht_discovery_community = ipv8_component.dht_discovery_community

        bandwidth_component = await self.use(BandwidthAccountingComponent, required=False)
        bandwidth_community = bandwidth_component.community if bandwidth_component.enabled else None

        download_component = await self.use(LibtorrentComponent, required=False)
        download_manager = download_component.download_manager if download_component.enabled else None

        rest_component = await self.use(RESTComponent, required=False)
        rest_manager = rest_component.rest_manager if rest_component.enabled else None

It does not look more concise because the code now takes into account that some optional components can be disabled in the Tribler configuration. But you can see the following differences:


  1. A part of the interface's implementation has to be moved to the tests (see comments in the example below)

As I wrote above, in my opinion, this is confusion caused by an unfortunate naming. FoobarComponentMock class is not intended to be used in tests exclusively. Its main purpose is to be used as a dummy replacement for the component when it is disabled in Tribler configuration. I think it is better to rename such classes to something like DummyFoobarComponent.


We have the following folder structure: https://github.com/Tribler/tribler/tree/main/src/tribler-core/tribler_core

  • components
  • config
  • modules
  • restapi
  • tests
  • upgrade
  • utilities

Components and modules should be merged into one single folder (as new "components" is old "modules").

Eventually, we should probably merge components and modules into a single folder. But right now, we don't have a one-to-one mapping between components and files in the modules folder.

In previous iteration of components architecture, components reside in the modules directory. The directory structure looked like this (a subset of files and subdirectories):

modules
    bandwidth_accounting
        bandwidth_endpoint.py
        community.py
        component.py
        ...
    category_filter
        category.py
        ... (no component.py here)
    exception_handler
        component.py
        exception_handler.py
        ...
    ipv8
        component.py
    libtorrent
        download.py
        download_manager.py
        ... (no component.py here)
    metadata_store
        community
            component.py
            gigachannel_community.py
            ...
        manager
            component.py
            gigachannel_manager.py
        restapi
            channels_endpoint.py
            metadata_endpoint.py
            ... (no component.py here)
        component.py
        config.py
        ...
    (...other subdirectories, some of them with components, others without)
    bootstrap.py
    component.py (base component class)
    dht_health_manager.py
    ...

Was this directory layout more convenient as compared to having all components in a dedicated directory? To me, it wasn't, for the following reasons:

  1. It was hard to answer the question: how many components do we have? As you can see, the files with components were placed on different levels of a directory structure, like metadata_store/component.py, metadata_store/community/component.py, metadata_store/manager/component.py, etc. With this layout, it is easy to miss a component hidden in some subfolder of another component.
  2. When a developer opens multiple component implementations in IDE at the same time, the names of the tabs, depending on IDE settings, can look either very uninformative (ten tabs with the same name "component.py") or pretty long (when IDE adds a directory name to the file name)

Having component implementations in a separate folder solves these two problems:

  1. All components are listed in the same directory, so it is easy to answer the question "which components do we currently have";
  2. It becomes possible to give components concise names that IDE can display nicely on tabs:
    components
    implementations
        bandwith_accounting.py
        gigachannel.py
        gigachannel_manager.py
        ipv8.py
        libtorrent.py
        (... and so on)

    But in return, you need to have one more separate directory.

Each layout has some benefits and drawbacks. So, choosing the directory layout is a trade-off. We need to discuss it and choose a less annoying layout that has the least number of drawbacks. Then we can switch to the most desired layout quickly, with a simple IDE-supported refactoring.


There are no comments for the new complex logic (base.py).

Agree with that. I need to add comments and probably some internal documentation. Probably descriptions from the current thread can be used as a start for it.


In conclusion, I think that the current implementation of components has its warts. I usually don't like to have a big number of files and directories, as well as boilerplate code. I hope we can improve the current architecture after the discussion. But we need to have a common understanding of component architecture goals, possible trade-offs, and viable alternatives. Hopefully, with the forthcoming meeting, we can achieve this.

ichorid commented 3 years ago

we have 10 system architects basically

And that's why no one cared about Tribler architecture all these years, right?

Their team composition seems to be too narrow focused on individuals with their own task, losing track of cardinal global issues.

I would argue that instead, this is exactly what was happening to our team all these years. PhDs and students don't care about the overall code quality and architecture. Their task is to get the job done ASAP and defend their thesis. Hence the (lack) of architecture. Also, our general scope was too broad.

never make a framework until you find yourself repeating the same thing 3 times.

The components refactoring is about not repeating the same thing 18 times over, which is:

In proper languages, such a C++, this kind of stuff is handled at the language level (e.g. constructors+destructors). Unfortunately, we have to invent our own :bike: in this case.

synctext commented 3 years ago

The components refactoring is about not repeating the same thing 18 times over [...] wait for pre-requisite component

Lets first slowly incrementally become feature complete, grow users and then do global changes. Are we fighting against our architecture to such an extend that it is slowing us down?

We learned for the mega pull request this summer that invasive architectural changes break a lot of our tooling (Gumby, app tester) and freeze all feature development for several months. Gumby is not even repaired and we're discussion the next big one? I'm opposed because its not broken. Module bootstrap frameworks may have operating system level complexity (systemd) and you easily fall into the trap of micro-services. However, if Tribler seniors @devos50 and @qstokkink agree that this is a good idea, then have fun!

devos50 commented 3 years ago

(Haven’t done a full review of the components code)

My overall opinion is simple: if it ain’t broke, don’t fix it.

I highly appreciate the effort that has been put in creating these components, and I think it’s the most exciting part of #6206. Although there might be a few parts I might disagree with on an implementation level, it is good to see that we make components explicit and more flexible. At the same time, my (controversial) opinion is that the old way of working with components did the job, while acknowledging that it was far from perfect (I argued about this when discussing #6206). In summary, I believe the additional benefits of yet another design iteration on components would be low and I'm not in favour of one. Feel free to prove me wrong on this.

Having a working component architecture that all developers agree with can untie developer's hands and allow them to work on new features with improved speed.

With respect to components, development time is only saved when creating new components/understanding existing ones, not necessarily when working on a particular component. Since most of the development time concentrates on improving existing components/fixing bugs, the saved development time seems to be marginal. Explicit components do speed up the process of understanding the Tribler architecture and inter-component relations though.

synctext commented 3 years ago

it is good to see that we make components explicit and more flexible.

My fear is another mega pull request for improving component architecture. Lets set a clear rule today (can obviously be overruled by agreement of all Tribler developers): no more mega pull request this year. None.

To add confusing nuance. Lets see: if ? @drew2a ? can create pull requests of small size and its merely 2-3 days of effort; then I'm very much for it. So: minimal viable beauty improvement. It that realistic?

drew2a commented 3 years ago

Based on the dev meeting (participants: @kozlovsky, @ichorid, @xoriole, @devos50, @drew2a) we decided that

ichorid commented 3 years ago

invasive architectural changes ... freeze all feature development for several months.

This is simply not true. The only reason why feature development was frozen is because of developers taking a summer vacation.

Are we fighting against our architecture to such an extend that it is slowing us down?

No, because everyone is smart enough not to touch the smelly parts. The result is people tend to take sub-optimal architectural choices, increase technical debt, and not reusing the code produced by others. Egbert is gone now, thank :godmode: we already refactored the download wrapper last year. Otherwise, no one would be able to understand what to do if something breaks.

drew2a commented 3 years ago

I've finished my part. All modules now merged with components:

image

@kozlovsky the remaining part is the documentation for component-session logic.

drew2a commented 1 year ago

Transition to the components has been done. For more than a year we use them and now it is clear that they are good enough. The issue can be closed.