bittorrent / bittorrent.org

384 stars 100 forks source link

Full Text Small World Searching BEP #25

Open artsokol opened 8 years ago

artsokol commented 8 years ago

Small_world_search_BEP.txt Proposal for extension that allows to run searching torrent files without central tracker

ssiloti commented 8 years ago

Is there an actual proposal somewhere or is this just a proposal for someone to write a proposal? In any case it is very unlikely such a BEP would be accepted. There is a distinct lack of good solutions to this problem.

artsokol commented 8 years ago

Today we publish document with our idea description. Waiting for your critical comments and questions.

arvidn commented 8 years ago

I would think that there are some privacy concerns with this protocol. For instance, it sounds like this would make it fairly easy to query a peer for all the torrents it's downloading/seeding. One could argue that to some degree this is already public information, but it's a lot harder to get today (the easy query is "which peers are downloading this torrent", but it's still hard to ask "which torrents is this peer downloading").

the8472 commented 8 years ago

I also don't see any defenses against malicious actors. Search networks are prime targets for inject of spam / fake / malware content under popular keywords.

artsokol commented 8 years ago

For instance, it sounds like this would make it fairly easy to query a peer for all the torrents it's downloading/seeding.

That's not correct. Other Host(peer) returns only one closest to our query other host and sws node belonged to or itself if there are no any. Possibly make sense to introduce new option "private" for torrent files that should not be involved in small world network

I also don't see any defenses against malicious actors. Search networks are prime targets for inject of spam / fake / malware content under popular keywords.

For the time being we should provide as more additional information about found files as we can (size, date of creating etc) to user can make a decision about downloading certain file. Some trust mechanism can be suggested in the nearest future by separate BEPs. For ex. small world is tightly connected network and complaining of one host causes the quick forwarding of rumor and malware info hash will be excluded from network by network.

aponom84 commented 8 years ago

@the8472, we understand that there is a problem related to "spam / fake / malware content under popular keywords". The main aim of this BEP is to bring to community the idea of the possibility of efficient full-text searching under P2P environment. We realize that there must be a mechanism for calculating the relevance. For example, relevance can be based on the number of seeding peers. Note, that a good torrent file with a good description together has a unique hash value. So it gives some protection against fakes.

the8472 commented 8 years ago

Well, the algorithm may be a starting point for a fulltext search, but to be implementation-ready other problems need to be solved too.

I think a BEP for full-text search should include some minimum resilience against malicious actors, a protocol specification, tangential problems like segmenting languages where whitespaces do not necessarily separate words (e.g. logographic languages) etc. etc.

All you propose is one building block out of several needed to build a fulltext search, that's not sufficient for a standards track BEP.

artsokol commented 8 years ago

We are in process of writing technical details and will provide it step by step. According information from BitTorrent.org first of all we should post Idea, get accept from community and get bep number. And that is why we have published just base theoretical description of our idea.

the8472 commented 8 years ago

In practice BEP1 is not really followed. You don't need a BEP number for now that the document is not even finished.

And a search feature is likely so complex that it'll need multiple BEPs for the building blocks. E.g. like BEP 9 relies on BEP 10 and optionally on BEP 5.

You'll probably need a working implementation that can be used as reference and to find real-world problems with the theoretical approach. possibly as stand-alone client or library.

I think everyone (other implementers) is quite skeptical of decentralized fulltext search because it's a very complex problem that other p2p networks have tried to tackle generally resulted in mixed outcomes.

The thing is that the search itself and its algorithmic complexity may not be the biggest problem. Gnutella&co already had limited-broadcast based solutions that sortof worked. The other problems are what really blows up the complexity and will make writing a concise BEP difficult.

Anyway, in my opinion a BEP would need to contain the following so that others could implment it in principle:

And that would just be the starting point for others to provide input, e.g. to improve privacy or add more defenses.


On a higher level, you have to consider what goals one would want from a fulltext search. some of the following goals are just nice-to-haves or weakly achieving them may be good enough:

artsokol commented 8 years ago

We are going to update our proposal regularly according parts finishing.

aponom84 commented 8 years ago

I just want to add a comment about computational complexity and performance. My colleagues and I have studied this issue for several years. We have wanted to adapt small world networks for searching, but our goal was make something more universal then well known DHT. And we successfully done it. We published several papers about efficient algorithms for approximate nearest neighbour search in high dimensional metric space. These algorithms construct navigable small world networks in one simple way. And our experiments have confirmed logarithmical search complexity of the searching procedure. You can more details in the following papers: [Malkov, Y., Ponomarenko, A., Logvinov, A., & Krylov, V. (2012). Scalable distributed algorithm for approximate nearest neighbor search problem in high dimensional general metric spaces. In Similarity Search and Applications (pp. 132-147). Springer Berlin Heidelberg.] - here you can find the details of algorithm. (This paper has awarded as a best paper in the SISAP conference) [Ponomarenko, A., Avrelin, N., Bilegsaikhan. N., Boytsov, L., 2014. Comparative Analysis of Data Structures for Approximate Nearest Neighbor Search. In Proceedings of The Third International Conference on Data Analytics] - in this paper you can find experiments with Wikipedia data set. Unfortunately we don't plot this graph in log log scale, but I sure you that in log log scale it has the from of a line. So now the nearest neighbour search in high dimensional spaces is not a "problem" and it is possible to implement logarithmic full text search even in the p2p setting.

artsokol commented 8 years ago

Added more details in common information, similarity and term vector description, searching algorithms and routing table.