Tribler / tribler

Privacy enhanced BitTorrent client with P2P content discovery
https://www.tribler.org
GNU General Public License v3.0
4.8k stars 444 forks source link

The relevance ranking algorithm when searching in Tribler needs to be redesigned #2250

Closed devos50 closed 4 weeks ago

devos50 commented 8 years ago

When searching in Tribler, search results (which can consist of torrents and channels) are ranked according to several criteria. Channels and torrents are sorted separately from each other. The associated channels of the torrents are also determined and in case a torrent is present in more than one channel, the channel with the highest number of votes is taken.

Channels are sorted based on the number of torrents in the channel. The algorithm to sort torrents is more complex. First, an array of five scores are constructed. These values are determined as follows:

  1. The number of matching keywords in the name of the torrent
  2. The number of matching keywords in the name of the files the torrent contains
  3. The position of the lowest matching keyword in the torrent name. For instance, when searching for ‘Away From Keyboard’ and ‘of’ is a matching keyword, this position will be 2.
  4. The number of matching keywords in the extensions of the file names of the torrent.
  5. A score that is based on several (normalised) attributes of the torrent. This score is determined after the set of local search results are constructed. The following formula is used: _score = 0.8 * normalized_num_seeder - 0.1 * normalized_num_negative_votes + 0.1 * normalized_numsubscriptions. Note that the number of negatives votes and number of subscriptions do not exist anymore and will always be 0.

After the torrents and channels are sorted, the results are put in a list. Every five torrent results, a channel result is inserted. This is to make sure that channels are also showing up in the search results (and more important, are visible to the user).

Also, I think it’s better to not be dependent on other search results to make streaming torrents from the database possible (like we do now when normalising). In other words, the final relevance score in the new algorithm should only be dependent on the data that this torrent object has.

I will start thinking about a new (more simple) algorithm that we could use for this and add the proposal in this issue.

Possible improvements:

Proposal for a new algorithm:

The new algorithm should take the matching with keywords better into account and be less dependent on other (confusing) variables such as number of seeders and number of torrents in the channel.

The above algorithm takes all of the improvements into account I outlined above.

devos50 commented 8 years ago

@whirm I need this for the new GUI and I almost have a proposal ready. Can you adjust the milestone and assign the issue to me?

devos50 commented 8 years ago

A first implementation of the algorithm as outlined above can be found here: https://gist.github.com/devos50/71b44339914259c82515f21846ee8173. Some minor tweaking might have to be done to make relevance ranking even better.

I will not implement this for the old GUI but rather make the REST API use the new algorithm.

synctext commented 8 years ago

nice! please try to write text and document this in a thesis.tex manner, instead of less formal issue rockets. saves work..

ghost commented 8 years ago

It might be worth considering that names aren't the only important thing a keyword search may be looking for - a 'deep search' of the content of files, too, used to be possible in earlier p2p networks (especially within special parts of files, like ID3 tags) and may yet be an important feature, long term to keep our eyes on.

Another thing worth thinking about is whether something like 'pagerank' could be relevant - whereas we have categories/channels that contain torrents, if you search for, for example 'thrones' if there's a channel of thrones it might be worth giving some kind of priority, even if a small amount, to all the torrents on that channel -- Channel: thrones S01E01.theora.ogg.torrent.magnet S01E02.theora.ogg.torrent.magnet S01E03.theora.ogg.torrent.magnet etc

synctext commented 8 years ago

@themusicgod1 Yes, we need accurate rich metadata, proper pageranking/itemranking, and good ID3 stuff.

btw. Worked on this myself 16 years ago, sadly never got solved (http://cd.ro.nu/hypermail/1046.html). Maybe an extra 16 years:-)

synctext commented 3 years ago

Re-opening this issue again. To also show this have getting attention in past years. Last update was 5 years ago. image ToDo: do a maximum 1-week only improvement sprint with basic heuristics. We will learn along the way, evaluate and come up with better design. Learn-by-doing methodology. Each matching swarm gets a score to determine the ranking. Basic score is the swarm size. (e.g. 2000 peers swarm get higher ranking than 10 peers). Heuristics:

  1. Biggest swarm first: Biggest swarm matching the keyword nicely should come first.
  2. One word matching rule: If you search for the creative commons movie "Sintel", the shortest matching swarm should be promoted. Matches such as the.script.from.the.movie.Sintel.pdf get lower score. Penalty if you have too many non-matching words.
  3. First word is the best rule: searching for "Sintel", the match "Sintel-Part-II" gets higher score than "Part-of-Sintel".
  4. Maximum matching rule. Search for multiple terms, like open source movie: "The Internet's Own Boy: The Story of Aaron Swartz". Penalty if you do not match all original search terms.
  5. No extra in-between words rule. Searching for "Internet's Own Boy" places results such as "Internet's very Own Boy" and "Internet's very special Boy person" lower than the exacter match: "Internet's Own Boy - Aaron Swartz"

Just mix in matching channels names for now. Treat them equally as matching swarms (with a translated popularity). Repeating: the goal is not to fully fix the "relevance ranking" problem, but to fix it 80% of the way and help us come up with a 97%-ish solution. Hence the time-boxed approach of 1 week of effort only please.

synctext commented 3 years ago

Tribler Search Engine - channel expansion

We can finally start doing relevance ranking. The popularity community is producing results and channels should be getting real-time browsing after reset! :confetti_ball: First priority is getting the above 5 heuristics implemented. Next heuristics, more invasive to QT:

  1. When a channel matches exactly the name of the "keyword search terms" it is shown in expanded state. See example
  2. When multiple channels exactly match the search, only the strongest is expanded.
  3. Adding thumbnails. Our users could be exposed to obscene thumbnails. Solution: only if the "family filter" is disabled by the user we display the thumbnail of the channel. Tribler_search_engine_design_v1
ichorid commented 3 years ago
  1. When a channel matches exactly the name of the "keyword search terms" it is shown in expanded state. See example

Yep. Users were even asking for something like that.

kozlovsky commented 2 years ago

During the last three weeks, I was finally able to work on the search quality improvements and have some interesting results. I can do a live demo tomorrow.

First, I implemented a new torrent_rank ranking function that completely replaces the Okapi BM25 function inside SQL queries:

torrent_rank(query, title, seeders, freshness) -> rank

rank is a float number in the range [0, 1). The better the match, the higher the rank. In my opinion, the new function does almost precisely what @synctext expects from it. Let's look at some examples:

The new function rates exact title match relatively high, but not the highest without the information about seeders and freshness:

torrent_rank('Big Buck Bunny', 'Big Buck Bunny') -> 0.81

When the number of seeders is known, the rank is higher, even better if the torrent creation date is known:

torrent_rank('Big Buck Bunny', 'Big Buck Bunny', seeders=100) -> 0.855
torrent_rank('Big Buck Bunny', 'Big Buck Bunny', seeders=100, freshness=100*DAYS) -> 0.8769230769230768

It is better for a torrent to have more seeders and to be recent:

torrent_rank('Big Buck Bunny', 'Big Buck Bunny', seeders=100, freshness=100*DAYS) -> 0.8769230769230768
torrent_rank('Big Buck Bunny', 'Big Buck Bunny', seeders=200, freshness=100*DAYS) -> 0.8923076923076921
torrent_rank('Big Buck Bunny', 'Big Buck Bunny', seeders=300, freshness=100*DAYS) -> 0.8999999999999999
torrent_rank('Big Buck Bunny', 'Big Buck Bunny', seeders=300, freshness=30*DAYS) -> 0.9262499999999999
torrent_rank('Big Buck Bunny', 'Big Buck Bunny', seeders=300, freshness=10*DAYS) -> 0.9506249999999999

If a torrent has creation date in the future and its freshness is negative, the freshness is ignored:

torrent_rank('Big Buck Bunny', 'Big Buck Bunny', freshness=10*DAYS) -> 0.8775
torrent_rank('Big Buck Bunny', 'Big Buck Bunny', freshness=-10*DAYS) -> 0.81

Now, let's concentrate on the title ranking. The exact match is good:

torrent_rank('Big Buck Bunny', 'Big Buck Bunny') -> 0.81

If the title contains some excess words at the end, the rank is slightly worse:

torrent_rank('Big Buck Bunny', 'Big Buck Bunny II') -> 0.8038167938931299

The more excess words we have in the title, the lower is the rank:

torrent_rank('Big Buck Bunny', 'Big Buck Bunny Movie Creation History') -> 0.7917293233082708

If the title contains in-between words, the rank is lower. The closer extra words are to the start of the title, the bigger the penalty. The biggest penalty is when extra words appear at the start of the title:

torrent_rank('Big Buck Bunny', 'Big Buck Bunny') -> 0.81
torrent_rank('Big Buck Bunny', 'Big Buck Bunny II') -> 0.8038167938931299
torrent_rank('Big Buck Bunny', 'Big Buck Brown Bunny') -> 0.7506109979633401
torrent_rank('Big Buck Bunny', 'Big Bad Buck Bunny') -> 0.7424206815511163
torrent_rank('Big Buck Bunny', 'Boring Big Buck Bunny') -> 0.7312500000000001

The more extra in-between words the title has, the lower the rank:

click to expand ```python torrent_rank('Big Buck Bunny', 'Big Buck Bunny') -> 0.81 torrent_rank('Big Buck Bunny', 'Big Buck Bunny a') -> 0.8038167938931299 torrent_rank('Big Buck Bunny', 'Big Buck Bunny a b') -> 0.7977272727272727 torrent_rank('Big Buck Bunny', 'Big Buck Bunny a b c') -> 0.7917293233082708 torrent_rank('Big Buck Bunny', 'Big Buck a Bunny') -> 0.7506109979633401 torrent_rank('Big Buck Bunny', 'Big Buck a b Bunny') -> 0.6993358633776092 torrent_rank('Big Buck Bunny', 'Big Buck a b c Bunny') -> 0.6546181172291297 torrent_rank('Big Buck Bunny', 'Big a Buck Bunny') -> 0.7424206815511163 torrent_rank('Big Buck Bunny', 'Big a b Buck Bunny') -> 0.6852494577006508 torrent_rank('Big Buck Bunny', 'Big a b c Buck Bunny') -> 0.6362537764350453 torrent_rank('Big Buck Bunny', 'a Big Buck Bunny') -> 0.7312500000000001 torrent_rank('Big Buck Bunny', 'a b Big Buck Bunny') -> 0.6664556962025318 torrent_rank('Big Buck Bunny', 'a b c Big Buck Bunny') -> 0.6122093023255814 ```

The wrong order of words in the title imposes a penalty:

torrent_rank('Big Buck Bunny', 'Big Buck Bunny') -> 0.81
torrent_rank('Big Buck Bunny', 'Big Bunny Buck') -> 0.7476923076923077

The penalty is very high if some words are missed from the title. The closer the missed word to the beginning of the query, the bigger the penalty. Several missed words lead to a huge penalty:

torrent_rank('Big Buck Bunny', 'Big Buck Bunny') -> 0.81
torrent_rank('Big Buck Bunny', 'Big Buck') -> 0.47250000000000003
torrent_rank('Big Buck Bunny', 'Big Bunny') -> 0.44181818181818183
torrent_rank('Big Buck Bunny', 'Buck Bunny') -> 0.405
torrent_rank('Big Buck Bunny', 'Bunny') -> 0.2858823529411765

If the query string or title string contains duplicates, the ranking function correctly takes them into account. If duplicates are in the query string, they should also be in the title, or the title receives a major penalty. If duplicates are in the title but not in the query, it gives a slight penalty to the result (and the order is also important):

torrent_rank('Run Bunny Run', 'Run Bunny Run') -> 0.81
torrent_rank('Run Bunny', 'Run Bunny Run') -> 0.8033057851239669  # excess word in the title
torrent_rank('Run Bunny Run', 'Run Run Bunny') -> 0.7476923076923077  # wrong order
torrent_rank('Run Bunny Run', 'Run Bunny') -> 0.47250000000000003  # missed word in the title

When title rank is not ideal, having many seeders and the recent torrent creation date is still useful. The torrent with a non-exact matching title can be higher in the search result list if it has a huge number of seeders. But if the title match is bad, having a huge number of seeders does not help:

torrent_rank('Big Buck Bunny', 'Big Buck Bunny', seeders=5, freshness=365*DAYS) -> 0.8211573236889693

torrent_rank('Big Buck Bunny', 'Big Buck Bunny II', seeders=5, freshness=365*DAYS) -> 0.8148889471722596
torrent_rank('Big Buck Bunny', 'Big Buck Bunny II', seeders=1000, freshness=30*DAYS) -> 0.934177654406662

torrent_rank('Big Buck Bunny', 'Bunny', seeders=5, freshness=365*DAYS) -> 0.28982023189022443
torrent_rank('Big Buck Bunny', 'Bunny', seeders=1000, freshness=30*DAYS) -> 0.33224598930481275

Now some more examples that @synctext gives:

  1. Biggest swarm first: Biggest swarm matching the keyword nicely should come first.

    torrent_rank('Sintel', 'Sintel') -> 0.81
    torrent_rank('Sintel', 'Sintel', seeders=100) -> 0.855
    torrent_rank('Sintel', 'Sintel', seeders=1000) -> 0.8918181818181817
  2. One word matching rule: If you search for the creative commons movie "Sintel", the shortest matching swarm should be promoted. Matches such as the.script.from.the.movie.Sintel.pdf get lower score. Penalty if you have too many non-matching words.

    torrent_rank('Sintel', 'Sintel') -> 0.81
    torrent_rank('Sintel', 'the.script.from.the.movie.Sintel.pdf') -> 0.5210526315789475
  3. First word is the best rule: searching for "Sintel", the match "Sintel-Part-II" gets higher score than "Part-of-Sintel".

    torrent_rank('Sintel', 'Sintel-Part-II') -> 0.7955357142857143
    torrent_rank('Sintel', 'Part-of-Sintel') -> 0.6649253731343284
  4. Maximum matching rule. Search for multiple terms, like open source movie: "The Internet's Own Boy: The Story of Aaron Swartz". Penalty if you do not match all original search terms.

    torrent_rank("The Internet's Own Boy: The Story of Aaron Swartz", "The Internet's Own Boy: The Story of Aaron Swartz") -> 0.81
    torrent_rank("The Internet's Own Boy: The Story of Aaron Swartz", "The Internet's Boy: The Story of Aaron Swartz") -> 0.4984615384615385
    torrent_rank("The Internet's Own Boy: The Story of Aaron Swartz", "The Story of Aaron Swartz") -> 0.1915720319099015
  5. No extra in-between words rule. Searching for "Internet's Own Boy" places results such as "Internet's very Own Boy" and "Internet's very special Boy person" lower than the exacter match: "Internet's Own Boy - Aaron Swartz"

    torrent_rank("Internet's Own Boy", "Internet's Own Boy") -> 0.81
    torrent_rank("Internet's Own Boy", "Internet's Own Boy - Aaron Swartz") -> 0.7985915492957747
    torrent_rank("Internet's Own Boy", "Internet's Very Own Boy") -> 0.7509933774834436
    torrent_rank("Internet's Own Boy", "Internet's Very Special Boy Person") -> 0.43531669865642997

As the ranking function is integrated directly into the SQL query, it works fast. For me, the difference with the previous search algorithm looks very significant - now, the query to the local database returns precisely the result I expect it to give me. Remote queries will start receiving improved results after a version of Tribler with new ranking functions is officially released and installed by users.

I hope to have a chance to demonstrate the proposed search algorithm tomorrow at the lab (with some UI improvements).

kozlovsky commented 2 years ago

Regarding the ideal search UI, I think it should satisfy the following guidelines:

  1. As multiple memes said, nobody goes to the second page of Google search. For me, it means that it is more important to display the most relevant results at the top of the list than to display the most extended result list possible. And this, in turn, means that pagination of the result list is not necessary. It is enough to display a single result page with several hundred torrents, but these torrents should be the best match for a query, with the most relevant results at the top.
  2. The search should be fast. As Google reports, even a slight delay in displaying search results affects user experience. For me, it means that the Tribler search UI should not add artificial delays to integrate local and remove search results. It is better to provide local results as fast as possible and display remote results later.
  3. The difference between local and remote results is a technical detail of Tribler implementation. What was remote before the search becomes local right after, so it is confusing to display remote and local results on separate tabs, and they should be displayed in a single list (as we have now).
  4. On the other side, displaying the remote results should not lead to situations when a user attempts to click on a torrent and a new remotely received torrent is inserted into the same position at that moment, which leads to the opening of a wrong torrent. So, displaying remote results should not be confusing to the user.
  5. The search result list should not display multiple torrents with the same title, as it is hard for a user to find a difference between them. With the current UI, we have results for some searches where the entire first page is filled with precisely the same title without any distinction between result rows. In my opinion, it is better to collapse multiple results with the identical title into a single group and expand it later on to user demand.

I'll try to demonstrate the UI prototype that satisfies these guidelines.

kozlovsky commented 2 years ago

The screenshot demonstrates how the proposed search UI looks. It is very simple and does not contain any complicated UI elements. After a user types in a query string, it immediately sees local results (within seconds). If later remote results bring some new torrents, a search title starts displaying the number of remotely received torrents and updates this number dynamically. The user can click on the title of the search page to immediately add received torrents to the result list. In that case, new torrents are sorted according to the rank and inserted in the proper places. In the current draft implementation, these new torrents have the * symbol added to the beginning of the title so that it is possible to find new torrents in the list for debugging purposes quickly. If we decide to use this concept of the simplified search UI, then instead of the prefix, we can temporarily highlight remotely received torrents with a bright color.

2022-06-17_07-50-06

synctext commented 2 years ago

progress since 6 years! let not be afraid of a single refresh, users hate to click on things for them to be useful. Search: display local results and don't wait for remote search. Wait maximum 2000 ms for remote search to come in?

Data driven:

synctext commented 2 years ago

Reminder, in the future we would like to do live searches. We have some screenshots taking feature within Application tester: https://jenkins-ci.tribler.org/job/Deployment-Tests/job/Run_Tribler_Release/job/Run_Win10_64bit/ws/screenshots/screenshot_1661867305.jpg Currently we don't use the live network, in order not to make it unhealthy with possible faulty code! Should we create an isolated testnet for 10 peers which are queried using the remote search?

synctext commented 1 year ago

Solid progress, now we have dark SQL magic :partying_face: Search is now much better with this upcoming PR. One issue still pending with performance. There has been a huge variance in showing results in the GUI (100 ms upto 3 seconds, with exceptional outliers of 10 seconds). Still unknown; race condition?? (still guessing)

We don't have any theoretically grounded algorithms to implement, first hire new phd student on this.

ToDo document magic constructs like term_weight = 5 / (5 + i). We currently still have hard coded values in the search ranking code torrent_rank('Big Buck Bunny', `Big Buck Brown Bunny`) == pytest.approx(0.75061099)

For 2022 and 2023 we want to improve search sophistication. We perceive 6 distinct levels (currently moving from level 1 to level 2). However, understanding what is going on in a distributed system composed of strangers with abusive spammers is known to be the hard problem. Reliable reproducing is essential.

  1. current Tribler with 30 years old BM25 algorithm
  2. ranking heuristics
  3. ranking with perfect metadata (1. tags and 2. thumbnails) image
  4. fuzzy learn-to-rank
  5. Row bundling
  6. perfect personalised learn-to-rank with perfect metadata and micro bubbles of similar superfans (e.g. Tiktok state-of-the-art)
synctext commented 4 weeks ago

@mg98 old 2016 issue, relevant for phd research.

synctext commented 4 weeks ago

Shifting this issue to phd student @mg98 :smiley:

Epic issue: https://github.com/Tribler/tribler/issues/7290