Tribler / tribler

Privacy enhanced BitTorrent client with P2P content discovery
https://www.tribler.org
GNU General Public License v3.0
4.79k stars 444 forks source link

searching for torrents is anything but anonymous #1055

Closed nlitsme closed 4 years ago

nlitsme commented 9 years ago

looking with tcpdump shows that my search strings and results show up in plaintext.

This is not what i expect from a client claiming to protect my privacy.

synctext commented 9 years ago

Still a work in progress. You don't talk to random people. People like Tor have no spec of search... We improved the scalability of Guntella-like approaches. +privacy.

krisives commented 9 years ago

@synctext Your response makes no sense. Maybe it's a language barrier?

This software seems to make more claims on being privacy protecting than it seems to be able to deliver on. I would advise removing those claims until you can support them.

tobia commented 9 years ago

The homepage carries a red warning sign that says:

Tribler does not protect you against spooks and government agencies.

This is admittedly cryptic, but it means: Tribler does not protect you against people who may infiltrate your network or your computer, or otherwise attack and intercept your internet traffic. In the same way, it does not protect you from your ISP, nor from 3-letter government agencies which may already be watching your traffic.

We are a torrent client and aim to protect you against lawyer-based attacks and censorship.

IANAL, but a "lawyer-based attack" is one that employs legally acceptable evidence. If a corporation hacks your network or your computer, they cannot bring that evidence in front of a judge.

Current BitTorrent clients make plainly available to any peer the list of other peers that are downloading and distributing that same file. No hacking is needed to know your IP address and the exact content you are dowloading / uploading. Private trackers, where you only get in by invitation, mitigate the issue, but do not solve it altogether.

Tribler aims to solve this very real threat—which may not be a threat to you, depending on your country. For the purpose of evading the media companies' lawyers, securing the search strings (and results) between your client and the rest of the Tribler network is not a priority. Seeing as it's technologically harder to do properly, postponing it to a later date seems like an acceptable compromise.

But I agree that the wording in the front page and in the anonymity page is ambiguous, possibly misleading. I would add a few explicit paragraphs on what Tribler does not protect you from, including the interception and logging of search strings and results by your ISP and government agencies.

All things considered, it would be bad publicity for the project if some moron used Tribler to search for, say, child porn, triggered a search and seizure, and ended up behind bars. It's best to discourage this kind of use, making it very plain where the anonymity starts and where it ends.

krisives commented 9 years ago

Tribler does not protect you against people who may infiltrate your network or your computer, or otherwise attack and intercept your internet traffic. In the same way, it does not protect you from your ISP, nor from 3-letter government agencies which may already be watching your traffic.

All of those can be avoided by properly using cryptography that you don't have to invent or implement, it's already available to you.

tobia commented 9 years ago

Using cryptography effectively is just as hard as inventing the crypto algorithms themselves.

Especially in a decentralized network such as Tribler's, how can you anonymize a client searching for a string, with respect to the nodes that need to perform the search? You can't use a symmetric key, otherwise anybody reading Tribler's source can decrypt your traffic. You may perform a secure key exchange with any peer you might be giving your search strings to, but then that peer knows who you are and what you are searching for.

So you need to onion-route the searches and replies, as much as the torrenting itself. It's not trivial, and it's not important for Tribler's stated goal, that's probably why it wasn't done (yet.)

But again, I agree that it should be made more clear in the homepage.

krisives commented 9 years ago

Especially in a decentralized network such as Tribler's, how can you anonymize a client searching for a string, with respect to the nodes that need to perform the search?

IBLT

krisives commented 9 years ago

Anyhow I'm out once someone says "cryptography is too hard" I stop wasting my breath.

tobia commented 9 years ago

O really. Maybe you should try submitting your proposal for a distributed secure search engine, since you're so knowledgeable. I'll laugh at the first timing attack, length extension attack, man in the middle attack, rainbow table attack, collision attack, birthday attack, or any other subtle and devious attack you didn't even know existed.

Cryptography is hard. Any programmer underestimating it (and there are many) is a danger to his/her profession and to all its end-users. There's a reason Tribler is based on the Tor protocol and did not come up with its own onion routing scheme, and it's not laziness and code reuse.

krisives commented 9 years ago

You're welcome to come read the code for my project "RealBay" shortly. It implements searches through an IBLT with Namecoin and standard DHT. It's certainly going to be better than sending keywords in plaintext over the wire and will guarantee plausible deniability.

Or you can stay here and let your fear of how hard cryptography is keep you from implementing better solutions.

r1k0 commented 9 years ago

@krisives url please? thx

krisives commented 9 years ago

@r1k0 Here is the URL :heart:

The GUI is still being made in node-webkit but the indexing and searching tools are being made in C. It uses a bloom filter and lookup tables of bloom filters to determine where in the index to download. This will be updated to work recursively by referencing the old index in the newest index.

To test you can get a list of torrents in CSV format and build an index:

 realbay-cli createindex torrents.csv myindex.tordex

It will generate some files (later these will be merged into one file) It expects a CSV with the first column as the "name" of the torrent and the second column is currently unused and we are fixing the CSV parse to accept arguments, and finally the third column is the hexadecimal hash to associate with those keywords.

After generating the index you can search it like this:

 ./realbay-cli findrecords myindex.tordex ubuntu

It will display all the hashes associated with that keyword.

Please let me know if you have any questions or want to get involved!

synctext commented 9 years ago

@Krisives sorry for the slow response, its been busy.

Interesting project using a DHT and Bitcoin-variant. How is the expected performance limit? In terms of RTTs.. How do you deal with spam? Can a single PC at 100mbps not flood you entire network with fake metadata and query responses? (With few /24 blocks).

Tribler uses crowd sourced voting to try to address this issue. We got 250k votes on channels of torrents now.

Finally, doing keyword search via onion routing only makes spamming more trivial. Trust in the encrypted domain is an unsolved hard problem. Users demand sub-second response speed, plus fuzzy matching. This has never been done for a deployed DHT imho.

krisives commented 9 years ago

How do you deal with spam?

You only browse Namecoin addresses you trust. So if you don't like torrents someone makes, don't browse their "realbay" address (which is a new Namecoin prefix being made, see this thread)

Can a single PC at 100mbps not flood you entire network with fake metadata and query responses?

Again if a "realbay" address is publishing indexes you don't like you can "unfollow" them or simply stop searching their indexes. Likewise trying to attack the Namecoin part of the system you will have to either break the cryptography in Namecoin (which is mostly the same of Bitcoin) or use more computing power than Namecoin, which merge-mines with Bitcoin (meaning it has a lot of the hashing power provided by Bitcoin miners)

There is work right now in re-organizing the index and additional compression to make it so you only download a very specific number of blocks. A lot of hard coded stuff is still laying around as we figure out more of the equations. Currently the index uses 256K sized pieces, but later that will be tunable.

Overall having every torrent ever published in about ~500MB meta data searchable is pretty cool!

Tribler uses crowd sourced voting to try to address this issue. We got 250k votes on channels of torrents now.

I'm much less of the "popularity makes things right" type of person and much more of a "associate with what you like, disassociate with what you don't like" type of person.

doing keyword search via onion routing only makes spamming more trivial.

Not sure why this matters. As long as the data contains the actual keywords I consider it a leak and think it damages plausible deniability. That's one of the reasons I prefer using bloom filters over a tree or index containing the actual words - in addition to the space saving.

Trust in the encrypted domain is an unsolved hard problem.

This is something being worked on in Namecoin as well as overlay networks that sit above Namecoin. It's important to note that when you search Namecoin you don't reveal your searches in any way, it's all client side. All of the searching in realbay are client side. When you lookup a Namecoin address, you use the locally downloaded blockchain data. After you get the address you lookup the DHT hash from somewhere and download the index.

Users demand sub-second response speed

Well, I'm not entirely there yet, but the idea is that even with millions of torrents in an index given a relatively unique search term you will only download about 256K of data total and apply your search to that data. Once they get matching infohash results they look those up in DHT. There is work being done to make it faster to look up many infohashes in the DHT at once (with one request) as well as an overlay network to replace DHT for TCP-only users (like Tor)

plus fuzzy matching.

If you have a realbay index you can search by multiple keywords and it will return only the results matching all the words. This is because of the nature of how bloom filters work, and you can combine their bits to join searches.

Keyword matching reduces all the words into lowercase and removes any non alpha-numeric characters. This can be improved, but it yields pretty good results. The bloom filter sizes chosen currently allow for about 16-20 keywords per torrent.

This has never been done for a deployed DHT imho.

Yes we don't do it through just the DHT, we use the realbay indexing format created to accomplish the fuzzy matching.

An example is that most people download torrents from larger sites that have lots of lots of torrents. It's likely an average user would only "browse" one or two realbay addresses. But that's already better than the current scenario where torrent sites are censored and taken down. Those sites are usually aggregators, they probably just take other realbay "feeds" of torrents.

Some people follow very specific sites for torrents. I could see people following more niche realbay addresses to get content they want on a more up-to-date basis.

synctext commented 9 years ago

Thank you for the extensive documentation of your work. I can now understand your technology decisions and Namecoin/DHT architecture. Interesting approach.

I'm much less of the "popularity makes things right" type of person and much more of a "associate with what you like, disassociate with what you don't like" type of person.

OK, that is a key difference with Tribler. Users are free to add channel subscriptions, your associate/disassociate feature. We try to offer the "Google experience" that by default you never see spam, computer-generated linkfarm content etc. Everything should happen correctly without user-feedback or addition of explicitly trusted peers.

Overall having every torrent ever published in about ~500MB meta data searchable is pretty cool!

That is where we started 9 years ago. It works quite nicely, but does not scale and is easy to spam. All Tribler 3.3.4 users shared a big bunch of torrents and pushed around updates. See below our latest research after these years of improvement. This is significantly more complex, but fully implemented and deployed on our supercomputer cluster. It is based on additively homomorphic encryption, specifically the Paillier cryptosystem.

However, totally unusable and cannot go into production usage. It is too slow:-( Queries take too many seconds.

In recent years fully decentralized file sharing systems were developed aimed at improving anonymity among their users. These systems provide typical file sharing features such as searching for and downloading files. However, elaborate schemes originally aimed at improving anonymity cause partial keyword matching to be virtually impossible, or introduce a substantial bandwidth overhead. In this paper we introduce 4P, a system that provides users with anonymous search on top of a semantic overlay. The semantic overlay allows users to efficiently locate files using partial keyword matching, without having to resort to an expensive flooding operation. Included into 4P are a number of privacy enhancing features such as probabilistic query forwarding, path uncertainty, caching, and encrypted links. Moreover, we integrate a content retrieval channel into our protocol allowing users to start downloading a file from multiple sources immediately without requiring all intermediate nodes to cache a complete copy. Using a trace-based dataset, we mimic a real-world query workload and show the cost and performance of search using six overlay configurations, comparing random, semantic, Gnutella, RetroShare, and OneSwarm to 4P. The state-of-the-art flooding based alternatives required approximately 10,000 messages to be sent per query, in contrast 4P only required 313. Showing that while flooding can achieve a high recall (more than 85% in our experiments) it is prohibitively expensive. With 4P we achieve a recall of 76% at a considerable reduction in messages sent.

http://ieeexplore.ieee.org/xpls/icp.jsp?arnumber=6707798 and http://ieeexplore.ieee.org/xpls/icp.jsp?arnumber=6934311

krisives commented 9 years ago

It is based on additively homomorphic encryption, specifically the Paillier cryptosystem.

Very interesting, I will look more into that as I was considering using HE for this myself, but not sure if it will matter much without FHE.

However, elaborate schemes originally aimed at improving anonymity cause partial keyword matching to be virtually impossible

This is something I'm okay with currently, but I think I can solve later on possibly with tabulation hashing.

For anyone curious there are PDFs online of both those papers without a paywall:

http://www.ijcnis.org/index.php/ijcnis/article/viewFile/139/89 and http://www.p2p-conference.org/p2p14/wp-content/uploads/2014/09/218.P2P2014_22.pdf

I'll have a look at those papers to see if they are relevant. From the scanning I did they seem to be basing their searching on hashing and walking the DHT, which is not something realbay does in any way. Realbay uses bloom filters instead of anything like that.

krisives commented 9 years ago

The URL to my other project was removed, so I'm re-posting it. The URL was requested by a user so I don't know why it's being removed.

https://github.com/realbay/realbay-cli/