Tribler / tribler

Privacy enhanced BitTorrent client with P2P content discovery
https://www.tribler.org
GNU General Public License v3.0
4.83k stars 447 forks source link

Tribler code review #3646

Closed ichorid closed 5 years ago

ichorid commented 6 years ago

This is the big list of things I found out during the review.

TODO:

Subsystem Overall state Outstanding problems Solution
Startup Complete mess Spread through several big objects and files. Rearrange objects and methods into a single module.
Torrent handling 19-th century cookbook: lots of cool recipes, funky language, no structure. There are several types of distinct torrent objects with repeating functionality; all torrent-managing code is in 2 giant objects; no clear sync/async separation; pieces of code are scattered all around the codebase; Step 1: move everything into a single module; Step 2: unify torrent objects; Step 3: refactor to no end;
RemoteTorrentHandler Evolutionary atavism Does not work Delete and forget.
SQL handling Ancient magic Everyone do their own magic. Gradually shift to ORM, while writing new and refactoring old code.
Core utility code (Tribler.Modules.*) A collection of random pages torn from cookbooks of different centuries. Lack of structure. Gradually move everything in its place.
Dispersy The Necronomicon Cthulhu Fhtagn! Reimplement Search, Channels, AllChannel with IPv8, scalable local gossip algorithm and DHT.
TFTP module Minion of the Old Gods Obsolete Delete and forget.
SOCKS proxy Neat and tidy Is in the Tribler, instead of IPv8 TunnelCommunity Move to IPv8 TunnelCommunity.
REST API Neat and tidy Everything is fine, almost no cruft there. Praise and care.
Torrent Checker Neat enough Code is fine, though the algorithm is not effecient. Enhance the algorithm.
Category filter, Credit mining OK Could use some cleaning, but basically OK. Keep clean
Tribler Config Telephone book Ok, but a lot of boilerplate code and copy-paste make the potential programmer prone to error. Use Python .dict trick to reduce code size 3 times. And make individual subsystems add options on their own to common object.
TriblerChain adapter A love note to IPv8 No problems. At all. Praise and care.
TriblerTunnel Another love note to IPv8, much more passionate and longer. Some methods are too long and not grouped. Otherwise, everything is OK. Create member objects for subtasks and split some methods into more readable form.
Market community Writings of a heretical sect of madmen who split from the Church of the Old Gods to establish a cult of their own. Total overengineering. I don't know.
ichorid commented 6 years ago

Monitor-bezel-crushing image of Tribler module imports.

packages_tribler out

Zoomed in. Dispersy: disp_pack

Market: market_pack

IPv8: ipv8

(p.s. used pyreverse from pylint to generate these. Had to remove all decorators, since it can't handle different decorators with the same name, or when a decorator has the same name as the function.)

ichorid commented 6 years ago

It seems RemoteTorrentHandler contains completely dead code. Like, it's not used anywhere, or never ran. And when you break it, nothing happens, Tribler works OK. It seems to only write torrents DB, and never read it.

Guys, could someone comment on this issue?

qstokkink commented 6 years ago

It should be used here in the SearchCommunity: https://github.com/Tribler/tribler/blob/b096b9bf48a3ac815c104408f84259b639ac7014/Tribler/community/search/community.py#L98

You should be able to trigger this through searching for a torrent.

ichorid commented 6 years ago

The only place where something _rtorrent_handler is used is: https://github.com/Tribler/tribler/blob/b096b9bf48a3ac815c104408f84259b639ac7014/Tribler/community/search/community.py#L526

But it is never run! https://github.com/Tribler/tribler/blob/b096b9bf48a3ac815c104408f84259b639ac7014/Tribler/community/search/community.py#L516 is always empty. You could literally comment out .download_torrent, and Tribler GUI works as usual. Channels, search, anonymous downloads, mining - everything works.

So, what the rtorrent_handler subsystem is intended to do?

ichorid commented 6 years ago

Basically, message.candidate is always empty here: https://github.com/Tribler/tribler/blob/b096b9bf48a3ac815c104408f84259b639ac7014/Tribler/community/search/community.py#L509

qstokkink commented 6 years ago

@egbertbouman can you say anything about this?

synctext commented 6 years ago

It should collect 50k of torrents slowly in the background. The TFTP binary transfer of torrents never worked smoothly. You found another skeleton in our collection... We have an open issue for magnet links only Tribler operation. Discussion started 3 years ago #1150

devos50 commented 6 years ago

I can probably answer your question about the remote torrent handler module.

Another purpose of the module, besides what @synctext says, is to speed up the process of fetching torrent metadata. In particular, when clicking on a torrent in Tribler (i.e. a search result or a torrent inside a channel), it would try to fetch the torrent from other candidates. The torrent file itself would be sent using the TFTP protocol and the data was sent over the Dispersy endpoint (since this endpoint was already NAT punctured by Dispersy). I even think there is a separate prefix for this.

It used to be this way in the old GUI. When I built the new GUI, I added support for fetching torrent metadata using HTTP/UDP trackers and the DHT. Since (re)adding support for the TFTP transfer of torrent metadata was not very high on my priority list, this feature was never added back to the GUI. The code still exists in the original code base, together with a few tests (which are not really unit tests...). I highly doubt whether we want to re-add this feature again. For now, I would be fine if you remove the TFTP module and code that manages fetching torrents from others. Let's discuss this next week 👍

egbertbouman commented 6 years ago

@devos50 The RemoteTorrentHandler used to do a lot more than just TFTP torrent collecting, it's also capable of collecting from magnet/DHT/candidates. All this functionality is still in there, but I guess we stopped using this while creating the new GUI.

As far as TFTP goes, I don't think this has ever worked properly. If you look at the code you'll see the that it will only try to collect torrents that we already have :cry: This means that collect-requests are still coming in, but we never actually discover a new torrent.

ichorid commented 6 years ago

I talked to everyone involved, and it seems the consensus is: "we used this stuff in the old days, but now its irrelevant, so it should go". So, its removal is on the list.

synctext commented 6 years ago

Well, before it is deleted... We need to know the consequences. No full .torrent collecting anymore? Have measurements proven that magnet-based BEP9 collecting is now fast enough? Is there a design for anonymous downloads and getting the .torrent?

ichorid commented 6 years ago

@synctext, I've just ran Tribler with this stuff completely disabled, and it worked fine. Search, channels, downloads - everything ran fine. Torrent collecting is redundant now.

egbertbouman commented 6 years ago

@synctext I looked into the TFTP torrent collecting, and at the moment it's not working at all. That means that we are only collecting torrents from the DHT/trackers after some user interaction (e.g., downloading a torrent). For testing purposes, I did a 1-line fix, and Tribler magically started collecting TFTP torrents. So, if we are OK with the amount of torrents we are collecting right now, we could remove it.

Note that because we are not currently collecting torrents in the background, we are essentially gossiping all our downloads within the SearchCommunity...

synctext commented 6 years ago

Thank you for finding this out! Quality of search result has been going down in Tribler, in my direct experience. That explains it, the torrent collected has gone down. How many torrents do people have on average? How many people have the healthy 50k torrents? Every Tribler client used to broadcast these health stats. How many dead torrents in the cache? We need a single dev for the content health I believe.

ichorid commented 6 years ago

@synctext, we're gonna scrape all of this in favor of AllChannel2.0, the Heir to Dispersy. We definitely want to have some mechanism for background torrent metadata exchange, at least for anonymity purposes. However, there is no point in reclaiming this old code. For "Dispersy v1.9 refactored" release we could do away with the simple mechanism of returning more than one torrent on search/update request. That would be just enough to support anonymity and quality of service.

synctext commented 6 years ago

51% attacks are now real

We first need a spam solution, protection against an attack in AllChannels mechanisms with merely $1000 worth of cloud servers. With solid foundation we can focus on fancy features. Please focus on linking Trustchain non-cheap identities to content discovery. Perfect if we can all do that in 1 upgrade, but I prefer to first remove the SyncAll broadcast in AllChannel community, upgrade to IPv8, and discard all channels votes with Trustchain > 2 hops (and a future Max-Edge-Flow > 1 GByte).

That is a lot of features to get flawlessly deployed, correct?

ichorid commented 6 years ago

Trustchain non-cheap identities to content discovery. Perfect if we can all do that in 1 upgrade, but I prefer to first remove the SyncAll broadcast in AllChannel community, upgrade to IPv8, and discard all channels votes with Trustchain > 2 hops (and a future Max-Edge-Flow > 1 GByte).

- that is exactly what I meant by "Dispersy v1.9 refactored"! To put to use the years of Tribler experience, we have to keep the minimal core of successful code/algorithms from AllChannel/Dispersy. To remove obstacles to future developments, we have to port it to IPv8. To make it secure enough for MarketCommunity, we have to add Trustchain linking. No one talks about fancy features ))

ichorid commented 6 years ago

Each object in our codebase has its own SQL access methods... Maybe we should discuss using ORM?

It would greatly simplify and unify all SQL-related stuff.

synctext commented 6 years ago

Yes, ORM is also on my 2019 wish list.

Our running old-IPv8 ORM code: http://svn.tribler.org/abc/branches/giorgos/datasync/sqlalchemy/schema.py

Please take the time to read about our operational 2010 work with this code: https://repository.tudelft.nl/islandora/object/uuid:52b586ea-6144-4b4e-a5a1-b05255ce493a

Plus IPv8 in 2010 with ORM: http://kayapo.tribler.org/trac/wiki/IPv8

synctext commented 6 years ago

There is a LOT of valuable experience in this code, so refactoring should be done carefully.

Impressive how fast you comprehend our code base.

ichorid commented 6 years ago

@synctext, I have to, to be able to keep up with the pace of development...

ichorid commented 6 years ago

IPv8 issues: Boilerplate code:

auth = BinMemberAuthenticationPayload(self.my_peer.public_key.key_to_bin()).to_pack_list()
dist = GlobalTimeDistributionPayload(self.claim_global_time()).to_pack_list()
payload = MyMessage(self.BLABLABLA).to_pack_list()

is a classic example of a boilerplate code that is used in every message in every community. @qstokkink , is there some reason why this is not moved to IPv8 _ez_pack, and the programmer have to do it by hand every time?

qstokkink commented 6 years ago

@ichorid the GlobalTimeDistributionPayload is deprecated and I want everyone making a message to be discouraged from putting it in there, by forcing them to add this extra line.

ichorid commented 6 years ago

@qstokkink, but it is in your wiki tutorial... What should we use instead of it?

qstokkink commented 6 years ago

@ichorid good point, I'll remove that.

ichorid commented 6 years ago

I've done some quick check for duplicates in Triber with Clonedigger. Market community is the first victim:

Source files: 33
Click here to show/hide file names

Clones detected: 71

753 of 3066 lines are duplicates (24.56%)

Parameters
clustering_threshold = 10
distance_threshold = 5
size_threshold = 5
hashing_depth = 1
clusterize_using_hash = False
clusterize_using_dcup = False

(p.s. the analysis does not include unit tests.)

market_duplicates.pdf

@devos50 , I guess this kind of analysis could be useful for your efforts with Market.

ichorid commented 6 years ago

Looks like IPv8 has some duplicates too:

Source files: 8

Clones detected: 138

872 of 6993 lines are duplicates (12.47%)

Parameters
clustering_threshold = 10
distance_threshold = 5
size_threshold = 5
hashing_depth = 1
clusterize_using_hash = False
clusterize_using_dcup = False

ipv8_duplicates.pdf

ichorid commented 6 years ago

Bear in mind, though, that Clonedigger's algorithm is language-agnostic and prone to false positives. So, take these stats with a proper grain of salt.

ichorid commented 6 years ago

For the whole Tribler/next, without Dispersy and Test directories, and without submodules:

Source files: 272

Clones detected: 2822

4892 of 30018 lines are duplicates (16.30%)

Parameters
clustering_threshold = 10
distance_threshold = 5
size_threshold = 5
hashing_depth = 1
clusterize_using_hash = False
clusterize_using_dcup = False

tribler_duplicates.pdf

ichorid commented 5 years ago

Anyone willing to continue this work - welcome! Closing it for now.

ichorid commented 5 years ago

After a year of refactoring, Dispersy is dead and Market is moved to a submodule. Here are the updated diagrams: image

image