Diaoul / subliminal

Subtitles, faster than your thoughts
http://subliminal.readthedocs.org
MIT License
2.39k stars 312 forks source link

OpenSubtitles: don't rely on hash that much #538

Closed pannal closed 8 years ago

pannal commented 8 years ago

I've made some changes to the OpenSubtitles provider in Sub-Zero to overcome some of the issues some users have reported.

Subliminal relies on the file hash too much; there are too many subtitles on OSS with wrong or duplicate hashes set, thus downloading wrong subtitles for the videos. (As hashes are "blindly" trusted by subliminal)

I've added a fail-safe to the hash-matching itself: https://github.com/pannal/Sub-Zero.bundle/blob/master/Contents/Libraries/Shared/subliminal_patch/patch_subtitle.py#L32

What I've also noticed is, that OpenSubtitles doesn't get queried with the movie name as a separate criterium, if the hash exists - this also is a problem as it narrows the search for matching subtitles too much. Fixed here: https://github.com/pannal/Sub-Zero.bundle/commit/af7434e35d84bf663730e9d2c7d82f62c1154e0f

Diaoul commented 8 years ago

So this is actually a two-fold issue, I'll try to split my answer.

  1. From a quick look at your patch I think the fail-safe won't work with thesubdb as thesubdb only provides the hash. At some point we need to trust the provider. If we start assuming provider give bad information this is where we're going to implement some weird code have inconsistent behavior. Then it's going to be difficult to implement new providers that fit properly in subliminal without patching subliminal's logic again. If trust issues have to be solved, it should be in a per-provider basis, not subliminal-wide. In the case of OSS, here seems a better place.
  2. Have you read OpenSubtitles API documentation? If the documentation is accurate, what you did in your patch doesn't actually changes anything as it is ignored server side. I agree this is suboptimal.
pannal commented 8 years ago
  1. I agree totally. maybe a specialcase for OSS would be fitting, then. That's the reason I've disabled TheSubDB support in Sub-Zero completely for now. Hashes are a good thing, but when they're controlled by people, they become a nuisance tbh.
  2. Yes I did. Doesn't change the fact, that it works, and it does so, better :) Their API docs aren't up-to-date. I've talked to them about their tag-matching (which would be a huge improvement for subliminal, as you currently can't search for an exact filename in the alternative-filenames section of a subtitle. You can, with the "tag" criterion, though): https://forum.opensubtitles.org/viewtopic.php?p=33008#p32971 - it was broken (not working), though documented.
pannal commented 8 years ago

BTW I think you misunderstood the docs: If you define moviehash and moviebytesize, then imdbid and query in same array are ignored. If you define imdbid, then moviehash, moviebytesize and query is ignored. If you define query, then moviehash, moviebytesize and imdbid is ignored. - That is only true for the current array.

But you're adding separate arrays as criterions in your query to OSS, so that doesn't apply.

And I have user reports that the matching on OSS for movies has massively improved since I removed the hash-fixation.

Diaoul commented 8 years ago

Filenames are also controlled by people, they can mess with them too.

Indeed, I missed the "in the same array" part, let's do this then, the more subtitles the better. Subliminal's scoring will get rid of the bad ones. Although we might hit the result limit and that should emit a warning because we might miss some good subtitles.

pannal commented 8 years ago

Yeah, that's why I would try to trust a filename one score point below hash, and impose the same limitations I've added as a fail-safe, to hash.

By the way: Is there a reason why the score value of "hash" is the maximum you allow in subliminal? I've patched that, as I needed provider-specific boosts. Addic7ed in general has better quality subtitles for many shows, which subliminal currently can't "decide", because the quality of a subtitle itself is not reflected by its filename or its matching hash.

pannal commented 8 years ago

Are you available on some other medium? Jabber/Skype/whatever? It'd be nice to actually talk about some core concepts of subliminal outside of this issue system.

Diaoul commented 8 years ago

Per provider score adjustments should be added in the future, not sure how yet but definitly part of the scoring. Same with provider specific ratings (download count, good/bad rating, etc.)

subliminal on freenode.

Diaoul commented 8 years ago

No sure placing your trust in the filename is the proper response. Take a look at filenames given by providers, some could match a renamed video file in the kind Movie.mkv. What about the rest of the path? Say I sort my videos like that: Movies/2014/Frozen.mkv.

Besides, if all the information in the filename matches the subtitle, where's the added value of the filename matching? I agree it could be helpful but I would trust it as much as codec and things like that so it can give a little boost to subtitles with exact matching.

pannal commented 8 years ago

I can give you a specific user-case for that.

One user of Sub-Zero searched for movie "Abc" with language "pt-BR". Results were there but their score weren't high enough to fit his settings, because some douche didn't fill in the metadata of his subtitle on OSS properly. In the alternative release names of that subtitle my user's filename was listed and he asked me why the subtitle wasn't downloaded.

That's the exact use case for the filename. Not matching withing OSS's title itself, but in the alternative release names list (which can't be queried directly, only by using "tag").

fernandog commented 8 years ago

@Diaoul sorry to jump in. don't we use OS settings "trusted" uploader? can we pass that as argument?

they use this image in the site: http://static.opensubtitles.org/gfx/icons/from_trusted.gif

is this? [SubFeatured] http://trac.opensubtitles.org/projects/opensubtitles/wiki/XMLRPC#SearchSubtitles

Diaoul commented 8 years ago

@pannal: I think I've implemented most of your OSS patch, can you please review ab29e34f5a7fc949c5bbb5f638d5633b6b193cee? @fernandog: this could be implemented as a per-provider subtitle score but subliminal doesn't allow that yet

fernandog commented 8 years ago

Do you think its per provider score is needed? I dont. User should always use score to match everything except resolution cause subs for 720p works for 1080p for example. So 132 score. If he wants a lower score he accepts the risk of getting out of sync subs. Dosnt matter the provider.

Diaoul commented 8 years ago

By a per provider score I mean given 2 subtitles with the same score, use a provider specific method to use one instead of another. For OpenSubtitles for example we might prefer the one with trusted publisher or the one with the most download count or something like that. It wouldn't affect the global scoring.

fernandog commented 8 years ago

ah I see. Sorry about that. Would be nice indeed

trusted providers is a good thing - ref to another issue about trusted.

In SR we handle provider priority doing search first for the 1st provider in the list, then the second. User can re-order that list

Maybe we can improve our code passing all providers in orders of preference, so we will do only one call to search. not multiple calls. Agree @medariox ?

image

Diaoul commented 8 years ago

You can use the lower level API for that. High level API gets the best subtitles within the given parameters, the provider from which it comes from is not a criteria and never will be.

What may happen in the future is that you can give extra score boost to some providers. With proper configuration you could achieve something like "get me the subtitles from addic7ed if it matches on the release group otherwise get it from opensubtitles".

@pannal: Can you give me an example where your fail-safe patch achieves better results? This is for unittest purposes.

pannal commented 8 years ago

@Diaoul I've just reviewed your reimplementation of my tag solution - looks good. Although you're not treating a tag match like a hash match, or didn't i see the full changes?

As for the fail-safe patch: I remember a mis-tagged video file with the hash of a completely different video file. I'll have to search in the SZ forum thread.

Edit: One example (not related to OSS but TheSubDB): https://forums.plex.tv/discussion/comment/1065334#Comment_1065334 Edit 2: There you go, OSS example: https://forums.plex.tv/discussion/comment/1055996/#Comment_1055996

Diaoul commented 8 years ago

I didn't add the hash match for tag indeed. Anything from the MovieReleaseName is already parsed by guessit and added as matches so the tag match should achieve highest score (after hash...).

I will change the way score is treated. Most likely I will change it to a value from 0 to 1 and scale it so the highest achievable score is equivalent to the hash score. So for example man.of.steel.2013.720p.bluray.x264-felony.mkv will have maximum achievable score {'title', 'year', 'resolution', 'format', 'video_codec', 'release_group'} and hash will be exactly equivalent to this. What do you think of this way?

fernandog commented 8 years ago

score values will be changed? If so, how get subs with score 132 as today (all matches except resolution? Or this will won't change?

Diaoul commented 8 years ago

Indeed that's an issue for global consistency... The thing that won't change is the matches. I will keep that for sure. Scoring needs refactoring, I don't know the exact changes I will make yet be there are some issues that need solving like @pannal said.

pannal commented 8 years ago

I didn't add the hash match for tag indeed. Anything from the MovieReleaseName is already parsed by guessit and added as matches so the tag match should achieve highest score (after hash...).

Well, how? Currently the tag match just is a way of receiving more results from OSS. The actual tag doesn't get rated, but the subtitle name on OSS is. Thus, if the subtitle name of a tag match is less accurate than, let's say, some other provider's subtitle match, the tag match actually may get scored lower, although it has been a tag match. That's why I added the fail-safe test on the subtitle name, and if it hits, I rate the tag match as high as a hash match. You can lower that by one score point, but I don't see why - a hash match is just as valuable as a filesize or tag match.

Diaoul commented 8 years ago

In the docs they say tag is an index, not a field and is related to release names so both MovieReleaseName and SubFileName. MovieReleaseName already has its matches computed, you're right about SubFileName, I should add matches of that one too. How would it be possible for an exact match to be rated less than some other match from another provider? In my example if MovieReleaseName or SubFileName (to be implemented) is man.of.steel.2013.720p.bluray.x264-felony.mkv then matches will be {'title', 'year', 'resolution', 'format', 'video_codec', 'release_group'} and no other provider could do better than that as the video filename doesn't carry more information than this.

The only two cases I see is additional metadata parsed from enzyme that could add a match and a hash match. Those are legitimate cases IMO.

pannal commented 8 years ago

Because man.of.steel.2013.720p.bluray.x264-felony.mkv doesn't get returned by the API, but the name the uploader entered. At least that's what I've experienced.

A tag search for man.of.steel.2013.720p.bluray.x264-felony.mkv could return a subtitle for man of steel 2013 engsubs and it wouldn't get rated as high as Man of Steel 2013 720p BluRay x264 in a normal search, even though its tag matched. I think a subtitle with a matching tag is more worth than any other guessit-rated subtitle, if certain conditions are met (in case of a movie: title and year match).

Diaoul commented 8 years ago

OK, I will try to get evidence of that for unittesting and add all matches of the original filename in case of a tag match. I'm currently looking for a way to add some kind of bonus score that is not necessarily a match in the filename (prefer one provider over another, etc.)

Diaoul commented 8 years ago

Found an example:

{'hash': '0', 'matched_by': 'tag', 'language': <Language [en]>, 
'subtitle_id': 1953339005, 'movie_imdb_id': 2161930, 'encoding': 'utf-8',
'series_season': 1, 'page_link': 'http://www.opensubtitles.org/en/subtitles/4795381/sid-95c1h390kcfjmiltlrfevmf800/house-of-cards-chapter-1-en',
'movie_release_name': ' House of Cards (2013) - 01x01 - Chapter 1.WEBrip-NTb',
'content': None, 'series_episode': 1, 'movie_name': '"House of Cards" Chapter 1',
'movie_year': 2013, 'hearing_impaired': False, 'movie_kind': 'episode'}
Diaoul commented 8 years ago

Please continue in this issue: #572