Closed pannal closed 8 years ago
So this is actually a two-fold issue, I'll try to split my answer.
BTW I think you misunderstood the docs:
If you define moviehash and moviebytesize, then imdbid and query in same array are ignored. If you define imdbid, then moviehash, moviebytesize and query is ignored. If you define query, then moviehash, moviebytesize and imdbid is ignored.
- That is only true for the current array.
But you're adding separate arrays as criterions in your query to OSS, so that doesn't apply.
And I have user reports that the matching on OSS for movies has massively improved since I removed the hash-fixation.
Filenames are also controlled by people, they can mess with them too.
Indeed, I missed the "in the same array" part, let's do this then, the more subtitles the better. Subliminal's scoring will get rid of the bad ones. Although we might hit the result limit and that should emit a warning because we might miss some good subtitles.
Yeah, that's why I would try to trust a filename one score point below hash, and impose the same limitations I've added as a fail-safe, to hash.
By the way: Is there a reason why the score value of "hash" is the maximum you allow in subliminal? I've patched that, as I needed provider-specific boosts. Addic7ed in general has better quality subtitles for many shows, which subliminal currently can't "decide", because the quality of a subtitle itself is not reflected by its filename or its matching hash.
Are you available on some other medium? Jabber/Skype/whatever? It'd be nice to actually talk about some core concepts of subliminal outside of this issue system.
Per provider score adjustments should be added in the future, not sure how yet but definitly part of the scoring. Same with provider specific ratings (download count, good/bad rating, etc.)
No sure placing your trust in the filename is the proper response. Take a look at filenames given by providers, some could match a renamed video file in the kind Movie.mkv. What about the rest of the path? Say I sort my videos like that: Movies/2014/Frozen.mkv.
Besides, if all the information in the filename matches the subtitle, where's the added value of the filename matching? I agree it could be helpful but I would trust it as much as codec and things like that so it can give a little boost to subtitles with exact matching.
I can give you a specific user-case for that.
One user of Sub-Zero searched for movie "Abc" with language "pt-BR". Results were there but their score weren't high enough to fit his settings, because some douche didn't fill in the metadata of his subtitle on OSS properly. In the alternative release names of that subtitle my user's filename was listed and he asked me why the subtitle wasn't downloaded.
That's the exact use case for the filename. Not matching withing OSS's title itself, but in the alternative release names list (which can't be queried directly, only by using "tag").
@Diaoul sorry to jump in. don't we use OS settings "trusted" uploader? can we pass that as argument?
they use this image in the site: http://static.opensubtitles.org/gfx/icons/from_trusted.gif
is this? [SubFeatured] http://trac.opensubtitles.org/projects/opensubtitles/wiki/XMLRPC#SearchSubtitles
@pannal: I think I've implemented most of your OSS patch, can you please review ab29e34f5a7fc949c5bbb5f638d5633b6b193cee? @fernandog: this could be implemented as a per-provider subtitle score but subliminal doesn't allow that yet
Do you think its per provider score is needed? I dont. User should always use score to match everything except resolution cause subs for 720p works for 1080p for example. So 132 score. If he wants a lower score he accepts the risk of getting out of sync subs. Dosnt matter the provider.
By a per provider score I mean given 2 subtitles with the same score, use a provider specific method to use one instead of another. For OpenSubtitles for example we might prefer the one with trusted publisher or the one with the most download count or something like that. It wouldn't affect the global scoring.
ah I see. Sorry about that. Would be nice indeed
trusted providers is a good thing - ref to another issue about trusted.
In SR we handle provider priority doing search first for the 1st provider in the list, then the second. User can re-order that list
Maybe we can improve our code passing all providers in orders of preference, so we will do only one call to search. not multiple calls. Agree @medariox ?
You can use the lower level API for that. High level API gets the best subtitles within the given parameters, the provider from which it comes from is not a criteria and never will be.
What may happen in the future is that you can give extra score boost to some providers. With proper configuration you could achieve something like "get me the subtitles from addic7ed if it matches on the release group otherwise get it from opensubtitles".
@pannal: Can you give me an example where your fail-safe patch achieves better results? This is for unittest purposes.
@Diaoul I've just reviewed your reimplementation of my tag solution - looks good. Although you're not treating a tag match like a hash match, or didn't i see the full changes?
As for the fail-safe patch: I remember a mis-tagged video file with the hash of a completely different video file. I'll have to search in the SZ forum thread.
Edit: One example (not related to OSS but TheSubDB): https://forums.plex.tv/discussion/comment/1065334#Comment_1065334 Edit 2: There you go, OSS example: https://forums.plex.tv/discussion/comment/1055996/#Comment_1055996
I didn't add the hash match for tag indeed. Anything from the MovieReleaseName is already parsed by guessit and added as matches so the tag match should achieve highest score (after hash...).
I will change the way score is treated. Most likely I will change it to a value from 0 to 1 and scale it so the highest achievable score is equivalent to the hash score.
So for example man.of.steel.2013.720p.bluray.x264-felony.mkv
will have maximum achievable score {'title', 'year', 'resolution', 'format', 'video_codec', 'release_group'}
and hash
will be exactly equivalent to this.
What do you think of this way?
score values will be changed? If so, how get subs with score 132 as today (all matches except resolution? Or this will won't change?
Indeed that's an issue for global consistency... The thing that won't change is the matches. I will keep that for sure. Scoring needs refactoring, I don't know the exact changes I will make yet be there are some issues that need solving like @pannal said.
I didn't add the hash match for tag indeed. Anything from the MovieReleaseName is already parsed by guessit and added as matches so the tag match should achieve highest score (after hash...).
Well, how? Currently the tag match just is a way of receiving more results from OSS. The actual tag doesn't get rated, but the subtitle name on OSS is. Thus, if the subtitle name of a tag match is less accurate than, let's say, some other provider's subtitle match, the tag match actually may get scored lower, although it has been a tag match. That's why I added the fail-safe test on the subtitle name, and if it hits, I rate the tag match as high as a hash match. You can lower that by one score point, but I don't see why - a hash match is just as valuable as a filesize or tag match.
In the docs they say tag is an index, not a field and is related to release names so both MovieReleaseName and SubFileName. MovieReleaseName already has its matches computed, you're right about SubFileName, I should add matches of that one too.
How would it be possible for an exact match to be rated less than some other match from another provider? In my example if MovieReleaseName or SubFileName (to be implemented) is man.of.steel.2013.720p.bluray.x264-felony.mkv
then matches will be {'title', 'year', 'resolution', 'format', 'video_codec', 'release_group'}
and no other provider could do better than that as the video filename doesn't carry more information than this.
The only two cases I see is additional metadata parsed from enzyme that could add a match and a hash match. Those are legitimate cases IMO.
Because man.of.steel.2013.720p.bluray.x264-felony.mkv
doesn't get returned by the API, but the name the uploader entered. At least that's what I've experienced.
A tag search for man.of.steel.2013.720p.bluray.x264-felony.mkv
could return a subtitle for man of steel 2013 engsubs
and it wouldn't get rated as high as Man of Steel 2013 720p BluRay x264
in a normal search, even though its tag matched. I think a subtitle with a matching tag is more worth than any other guessit-rated subtitle, if certain conditions are met (in case of a movie: title and year match).
OK, I will try to get evidence of that for unittesting and add all matches of the original filename in case of a tag match. I'm currently looking for a way to add some kind of bonus score that is not necessarily a match in the filename (prefer one provider over another, etc.)
Found an example:
{'hash': '0', 'matched_by': 'tag', 'language': <Language [en]>,
'subtitle_id': 1953339005, 'movie_imdb_id': 2161930, 'encoding': 'utf-8',
'series_season': 1, 'page_link': 'http://www.opensubtitles.org/en/subtitles/4795381/sid-95c1h390kcfjmiltlrfevmf800/house-of-cards-chapter-1-en',
'movie_release_name': ' House of Cards (2013) - 01x01 - Chapter 1.WEBrip-NTb',
'content': None, 'series_episode': 1, 'movie_name': '"House of Cards" Chapter 1',
'movie_year': 2013, 'hearing_impaired': False, 'movie_kind': 'episode'}
Please continue in this issue: #572
I've made some changes to the OpenSubtitles provider in Sub-Zero to overcome some of the issues some users have reported.
Subliminal relies on the file hash too much; there are too many subtitles on OSS with wrong or duplicate hashes set, thus downloading wrong subtitles for the videos. (As hashes are "blindly" trusted by subliminal)
I've added a fail-safe to the hash-matching itself: https://github.com/pannal/Sub-Zero.bundle/blob/master/Contents/Libraries/Shared/subliminal_patch/patch_subtitle.py#L32
What I've also noticed is, that OpenSubtitles doesn't get queried with the movie name as a separate criterium, if the hash exists - this also is a problem as it narrows the search for matching subtitles too much. Fixed here: https://github.com/pannal/Sub-Zero.bundle/commit/af7434e35d84bf663730e9d2c7d82f62c1154e0f