Localsearch: Returns completely irrelevant results at the bottom of the list

jaruba commented 4 years ago

While testing the local search, I went across this interesting case.

To reproduce:

search query: "a series of unfortunate events"
dataset size: 10k
maximum results: 15

Screenshot of results:

Screenshot of Stremio's current results: (which seem much more accurate)

MartinKavik commented 4 years ago

Returns completely irrelevant results at the bottom of the list

Which results do you mean? For instance, the last one - "LEGO DC Comics Super Heroes: Aquaman - Rage of Antlantis" matches by prefix ("a series of unfortunate events").

Default tokenizer only split search query by whitespace so a is a valid token and I can't determine if it's a bug or a feature => what's the expected behavior / results?

jaruba commented 4 years ago

Removed "a" and searched for "series of unfortunate events":

The expected behaviour is to get results that are relevant to the search query from a human perspective. Currently Stremio gives results that are relevant from a human perspective, while local search does not.

In both the screenshoted cases all the results after the first 2 seem meaningless. As we are working with many movie / series titles, it might indeed be a good idea to not match things based on a, the, of, and, or, and other very general words used in the majority of movie / series titles, as it will bring many false positives.

But even removing "a" still brings weird responses, and in both those cases i get sexual related content which means we simply can't make this search live until the results become less fluid.

Maybe a good idea would be to ignore search results with scores that stray away too much from the primary results?

MartinKavik commented 4 years ago

Maybe a good idea would be to ignore search results with scores that stray away too much from the primary results?

I've improved an algorithm a bit and search results with the score lower than the predefined threshold are ignored. The default threshold is now set to first_result_score * 0.48.

Updated version deployed to https://stremio-search.netlify.app/

Ivshti commented 4 years ago

@jaruba can you please check?

jaruba commented 4 years ago

this seems better for queries that are expected to have few results, like a series of unfortunate events and seinfeld

but i see some serious issues regarding ordering when used for queries that are expected to have many results:

results for avengers:

^ in this case the first result is a series from 1961 (which definitely shouldn't be on top, the series has ok ratings but it has very few votes on IMDB), the second and third results are cartoons, one has a rating of 2.7/10 and on the 4th place there's a 1998 movie that scores 3.8/10.. "Avengers Endgame" which is probably the most popular of the results is last on the 11th place, while "The Avengers (2012)" movie which I would of expected to see as the first result is on the 5th place

results for superman:

^ the results for "superman" are missing the latest movie releases: Batman v Superman: Dawn of Justice (2016), Man of Steel (2013), it's also missing the well known series: The New Adventures of Superman (1993), and the latest series based on the story: Smallville (2001) (aka "Young Superman")

jaruba commented 4 years ago

are we checking the aka titles too when searching? as for the case of superman, the aka titles do include "superman" as a word, while not all the primary titles include it

MartinKavik commented 4 years ago

Source dataset is https://stremio-search.netlify.app/data/cinemeta_20_000.json. The search uses only field name. If you want to integrate custom logic (e.g. boosting according to IMDB score or title aliases) I would need more detailed specifications and updated dataset because it won't be trivial to implement correctly.

jaruba commented 4 years ago

Cinemeta has an internal popularity score, a simplistic solution to this would be to sort the items with the same score by their popularity. Notice that in the case of avengers the first 6 results have the same exact score (31.557823863034535).

This would still not fix the fact that "Avengers: Endgame", which should be much higher up in the results, would still be last on the 11th place.

Maybe a better solution would be to ignore the matching score after the results have been selected, and simply sort all the results by popularity at the end.

I'd say that at this point, the best results would be gotten by including the aka titles in the search too (i realise that this would decrease the speed of searching though) and sort all final results by their popularity.

jaruba commented 4 years ago

What I'm not sure about is if we can get only the english aka titles, as the ones in other languages would increase the dataset size and might lead to many false positives in the results, although it would also lead to better search results for users that will search in their native languages.

jaruba commented 4 years ago

I'm also curious why a dataset of 20k includes items that have scores of 2.7 and 3.8 on IMDB.

@Ivshti how do we choose what is added to the dataset? should it not include only popular items?

Ivshti commented 4 years ago

It should not include unpopular items and indeed it should include only popular ones

I didn’t have much time to tinker with it so I just exported the top N by Stremio popularity. @jaruba You should tweak the dataset so that it works better. It’s just a simple export of the top catalog of cinemeta, by joining the first N/100 pages

MartinKavik commented 4 years ago

@jaruba You should tweak the dataset so that it works better. It’s just a simple export of the top catalog of cinemeta, by joining the first N/100 pages.

@jaruba Did you have time to prepare ideal dataset from your point of view? If not, I'll try to do it according to the comments above. Thanks.

jaruba commented 4 years ago

@MartinKavik I'll be able to do it at the start of next week.

jaruba commented 4 years ago

Let's try with this export, I filtered by:

imdbRating exists
imdbRating is larger then 5
popularity exists

The export includes 10k movies and 10k series, for each movie and series it also includes imdbRating and popularity.

It does not include title AKAs, as we don't seem to have those available in Cinemeta right now.

export.zip

jaruba commented 4 years ago

We might also consider basing the export on other (more accurate) popularity lists from IMDB or similar qualified reviewer databases if required.

MartinKavik commented 4 years ago

I've tried some algorithms for boosting/sorting according to IMDB Rating and Popularity. And the best one is probably

let imdb_rating_boost = (record.imdb_rating / max_imdb_rating * imdb_rating_weight).exp();  // `exp2()` should be also ok if it's faster
...
let score = original_score * imdb_rating_boost * ...

where you can set imdb_rating_weight through a variable IMDB rating weight - see the picture below:

I've also added field Score threshold (%) - 50 would filter out all results that have the score lower than 50% of the top result. 0 effectively disables filtering by score.

These boosts helps with the most cases (including the Avengers: Endgame problem your mentioned above). However it doesn't help with superman - aka titles would be necessary for more relevant results.

Deployed as usual on https://stremio-search.netlify.app/

Q: How is popularity computed? And does it have a maximum value? Thanks.

jaruba commented 4 years ago

Sorry for the late reply, i was very busy in this time

The results seem much better now imo, aka titles would be a nice to have too, but we can't get them from Cinemeta right now.

There's one more interesting case I've found:

But when adding a - we get:

I'm wondering what the underlying issue is here, and why the results from spider- don't show for spider too.

jaruba commented 4 years ago

Regarding:

Q: How is popularity computed? And does it have a maximum value? Thanks.

Only @Ivshti could answer that.

MartinKavik commented 4 years ago

Ad spider - I've improved the tokenizer and it also takes into account dashes (-) now. So it should work as expected.
Ad popularity - Ivo's answer on Slack - it's moviedb_rating*A * how_many_times_In_stremio_Lib*B, doesn't have a max value. I've updated code to reflect that with some minor search algorithm improvements.

jaruba commented 4 years ago

@MartinKavik I think this is good, or as close as we're going to get to it, I can't think of any further improvements right now.

elpiel commented 1 year ago

Since we're already looking to integrate this crate in core and provide autocompletion with it, I'm closing this issue and will link it for reference.

Stremio / stremio-core

Localsearch: Returns completely irrelevant results at the bottom of the list #148