Closed jaruba closed 1 year ago
Returns completely irrelevant results at the bottom of the list
Which results do you mean? For instance, the last one - "LEGO DC Comics Super Heroes: Aquaman - Rage of Antlantis" matches by prefix ("a series of unfortunate events").
Default tokenizer only split search query by whitespace so a
is a valid token and I can't determine if it's a bug or a feature => what's the expected behavior / results?
Removed "a" and searched for "series of unfortunate events":
The expected behaviour is to get results that are relevant to the search query from a human perspective. Currently Stremio gives results that are relevant from a human perspective, while local search does not.
In both the screenshoted cases all the results after the first 2 seem meaningless. As we are working with many movie / series titles, it might indeed be a good idea to not match things based on a
, the
, of
, and
, or
, and other very general words used in the majority of movie / series titles, as it will bring many false positives.
But even removing "a" still brings weird responses, and in both those cases i get sexual related content which means we simply can't make this search live until the results become less fluid.
Maybe a good idea would be to ignore search results with scores that stray away too much from the primary results?
Maybe a good idea would be to ignore search results with scores that stray away too much from the primary results?
I've improved an algorithm a bit and search results with the score lower than the predefined threshold are ignored. The default threshold is now set to first_result_score * 0.48
.
Updated version deployed to https://stremio-search.netlify.app/
@jaruba can you please check?
this seems better for queries that are expected to have few results, like a series of unfortunate events
and seinfeld
but i see some serious issues regarding ordering when used for queries that are expected to have many results:
avengers
:
^ in this case the first result is a series from 1961 (which definitely shouldn't be on top, the series has ok ratings but it has very few votes on IMDB), the second and third results are cartoons, one has a rating of 2.7/10 and on the 4th place there's a 1998 movie that scores 3.8/10.. "Avengers Endgame" which is probably the most popular of the results is last on the 11th place, while "The Avengers (2012)" movie which I would of expected to see as the first result is on the 5th place
superman
:^ the results for "superman" are missing the latest movie releases: Batman v Superman: Dawn of Justice (2016), Man of Steel (2013), it's also missing the well known series: The New Adventures of Superman (1993), and the latest series based on the story: Smallville (2001) (aka "Young Superman")
are we checking the aka titles too when searching? as for the case of superman
, the aka titles do include "superman" as a word, while not all the primary titles include it
Source dataset is https://stremio-search.netlify.app/data/cinemeta_20_000.json. The search uses only field name
. If you want to integrate custom logic (e.g. boosting according to IMDB score or title aliases) I would need more detailed specifications and updated dataset because it won't be trivial to implement correctly.
Cinemeta has an internal popularity score, a simplistic solution to this would be to sort the items with the same score by their popularity. Notice that in the case of avengers
the first 6 results have the same exact score (31.557823863034535
).
This would still not fix the fact that "Avengers: Endgame", which should be much higher up in the results, would still be last on the 11th place.
Maybe a better solution would be to ignore the matching score after the results have been selected, and simply sort all the results by popularity at the end.
I'd say that at this point, the best results would be gotten by including the aka titles in the search too (i realise that this would decrease the speed of searching though) and sort all final results by their popularity.
What I'm not sure about is if we can get only the english aka titles, as the ones in other languages would increase the dataset size and might lead to many false positives in the results, although it would also lead to better search results for users that will search in their native languages.
I'm also curious why a dataset of 20k includes items that have scores of 2.7 and 3.8 on IMDB.
@Ivshti how do we choose what is added to the dataset? should it not include only popular items?
It should not include unpopular items and indeed it should include only popular ones
I didn’t have much time to tinker with it so I just exported the top N by Stremio popularity. @jaruba You should tweak the dataset so that it works better. It’s just a simple export of the top catalog of cinemeta, by joining the first N/100 pages
@jaruba You should tweak the dataset so that it works better. It’s just a simple export of the top catalog of cinemeta, by joining the first N/100 pages.
@jaruba Did you have time to prepare ideal dataset from your point of view? If not, I'll try to do it according to the comments above. Thanks.
@MartinKavik I'll be able to do it at the start of next week.
Let's try with this export, I filtered by:
imdbRating
existsimdbRating
is larger then 5
popularity
existsThe export includes 10k movies and 10k series, for each movie and series it also includes imdbRating
and popularity
.
It does not include title AKAs, as we don't seem to have those available in Cinemeta right now.
We might also consider basing the export on other (more accurate) popularity lists from IMDB or similar qualified reviewer databases if required.
I've tried some algorithms for boosting/sorting according to IMDB Rating and Popularity. And the best one is probably
let imdb_rating_boost = (record.imdb_rating / max_imdb_rating * imdb_rating_weight).exp(); // `exp2()` should be also ok if it's faster
...
let score = original_score * imdb_rating_boost * ...
where you can set imdb_rating_weight
through a variable IMDB rating weight
- see the picture below:
I've also added field Score threshold (%)
- 50
would filter out all results that have the score lower than 50% of the top result. 0
effectively disables filtering by score.
These boosts helps with the most cases (including the Avengers: Endgame
problem your mentioned above). However it doesn't help with superman
- aka titles would be necessary for more relevant results.
Deployed as usual on https://stremio-search.netlify.app/
Q: How is popularity
computed? And does it have a maximum value? Thanks.
Sorry for the late reply, i was very busy in this time
The results seem much better now imo, aka titles would be a nice to have too, but we can't get them from Cinemeta right now.
There's one more interesting case I've found:
But when adding a -
we get:
I'm wondering what the underlying issue is here, and why the results from spider-
don't show for spider
too.
Regarding:
Q: How is popularity computed? And does it have a maximum value? Thanks.
Only @Ivshti could answer that.
spider
- I've improved the tokenizer and it also takes into account dashes (-
) now. So it should work as expected.popularity
- Ivo's answer on Slack - it's moviedb_rating*A * how_many_times_In_stremio_Lib*B, doesn't have a max value
. I've updated code to reflect that with some minor search algorithm improvements.@MartinKavik I think this is good, or as close as we're going to get to it, I can't think of any further improvements right now.
Since we're already looking to integrate this crate in core and provide autocompletion with it, I'm closing this issue and will link it for reference.
While testing the local search, I went across this interesting case.
To reproduce:
Screenshot of results:
Screenshot of Stremio's current results: (which seem much more accurate)