MestreLion / legendastv

API for Legendas.TV website, world's largest repository of Brazilian Portuguese Movie/TV Series subtitles. Utilities to search, retrieve info, download, extract and match subtitles.
15 stars 3 forks source link

Similarity is not taken into account for certain scenarios #17

Closed ratoaq2 closed 9 years ago

ratoaq2 commented 9 years ago

It's easier to give an example:

Given my configuration: similarity = 0.94

And for the given input file: Deliver us From Evil 2014 SWESUB 720p BluRay x264 Mr Stiffy

I'm getting a subtitle that's only 0.5 similar: dict: {'best': {'compare': u'Deliver Us From Evil LIMITED DVDRip XviD iMBT', 'full': u'/home/osmc/. cache/legendastv/archives/UnitedTeam-correcaobafb93ccc6c52925a9581b3e882ee403/Deliver.Us. From.Evil.LIMITED.DVDRip.XviD-iMBT By UNITED4EVER/Legendas Comuns/Com It\xe1licos/Deliver.Us. From.Evil.LIMITED.DVDRip.XviD-iMBT.srt', 'original': u'Deliver.Us.From.Evil.LIMITED.DVDRip.XviD-iMBT. srt'}, 'similarity': 0.5}

MestreLion commented 9 years ago

The similarity option in legendastv.ini is used only to identify the correct movie title, as title search in Legendas.tv website is a free text search. So a similarity _threshold_ is needed to make sure we're not picking a completely unrelated movie: if not similar enough, we consider it's not the movie we are looking for, and _discard_ it.

Once we have the idfilme from the chosen title, subtitles are fetched by this ID, so all subtitles in the list are _known_ to be for this title. The similarity option is no longer used (nor needed) from now on.

Sure, we also use a similarity _rating_ (not a threshold) when ordering the subtitles list, to find the subtitle release that best matches the filename. But no need for a threshold: subtitles with poor similarity tend to have a poor overall ranking. Similarity is not the only criteria for ranking, but it has a high weight. Actually similarity alone accounts for half the score!

Remember that all candidates are guaranteed to be for your movie title (per ID), and wrong episodes were already discarded. And the chosen subtitle is _the best_ candidate among all other.

Can it still have a poor similarity and be for the wrong release? Sure. But it's still better than the others. I's the best subtitle legendas.tv website had to offer for that movie.

Similarity index needs context: when searching for a movie, we have a reference string to compare to, usually from OSDB. We want movie XYZ, similarity is our only criteria, and no other title would suit us. In that context, it makes perfect sense to set an absolute threshold to avoid picking the wrong movie.

For subtitles, context is completely different: similarity is compared against the other candidates, and the best wins. Their absolute index is irrelevant. A subtitle registered as "Movie.XYZ.2004.all bluray releases from DIMENSION/YIFI/FNQ/PQP/VTNC. com ressync do INSUBS" will have a very poor similarity but may be exactly the one you want.

Even if it was selected not because it was a great match but because it was the only subtitle found for that title, there's nothing we can do. Better to have a subtitle for that movie than having none. (but the same does not apply for movie titles)

Last but not the least, subtitle release strings are a very loose format. There is no standard, as opposed to movie title that has an "official" name. We have no way to tell if a poorly similar subtitle is a good fit or not.

MestreLion commented 9 years ago

By the way, 0.94 looks like an extremely high threshold. Aren't you getting too many (valid) titles discarded because of that? Is the built-in filename parser able to extract titles with that level of similarity from your files? I'm impressed! Either your filenames have a very standard format (did you manually rename them?), or my humble parser is better than I thought. Also, it indicates the titles in Legendas.TV database are more trustworthy (and matching OSDB's) than I expected. Good :)

MestreLion commented 9 years ago

Now back your your particular issue:

ratoaq2 commented 9 years ago
2015-02-28 15:53:00,321 DEBUG    Target: /home/osmc/projetos/legendastv/Nothing But The Truth 2008 720p BluRay x264-ARiGOLD
2015-02-28 15:53:00,322 DEBUG    Guessed title info: 'Nothing But The Truth 2008 720p BluRay x264-ARiGOLD' -> {'release': u'Nothing But The Truth 2008 720p BluRay x264 ARiGOLD', 'title': u'Nothing But The Truth', 'year': u'2008'}
2015-02-28 15:53:00,423 DEBUG    OSDB.LogIn(u'', u'***', u'', u'Legendas.TV v1.0') -> {'status': '200 OK', 'seconds': 0.015, 'token': '3eor9lrgk6q327i5gohqvd8s45'}
2015-02-28 15:53:00,424 ERROR    File '/home/osmc/projetos/legendastv/Nothing But The Truth 2008 720p BluRay x264-ARiGOLD' must be at least 65536 bytes
2015-02-28 15:53:00,424 DEBUG    0 OpenSubtitles titles found:

2015-02-28 15:53:00,452 NOTIFY   Logging in Legendas.TV
2015-02-28 15:53:00,457 INFO     Logging in http://legendas.tv/login as *****
2015-02-28 15:53:01,683 NOTIFY   Searching titles for 'Nothing But The Truth'
2015-02-28 15:53:01,684 DEBUG    loading /legenda/sugestao/Nothing+But+The+Truth
2015-02-28 15:53:01,957 DEBUG    Titles found for 'Nothing But The Truth':
    {'title_br': u'The 4400.S04E04.HDTV.XviD-BiA.The Truth and Nothing But the Truth', 'thumb': None, 'title': u'The Truth and Nothing But the Truth', 'season': u'4', 'imdb_id': u'1049219', 'year': u'2007', 'type': u'episode', 'id': u'12926'}
    {'title_br': u'Nothing But the Truth', 'thumb': u'http://i.legendas.tv/poster/tt1073241.jpg', 'title': u'Nothing But the Truth', 'season': None, 'imdb_id': u'1073241', 'year': u'2008', 'type': u'movie', 'id': u'14841'}
2015-02-28 15:53:01,967 NOTIFY   2 titles found
2015-02-28 15:53:01,968 DEBUG    Chosen best for 'Nothing But The Truth' in 'search': {'best': {'title_br': u'Nothing But the Truth', u'search': u'Nothing But the Truth', 'thumb': u'http://i.legendas.tv/poster/tt1073241.jpg', 'title': u'Nothing But the Truth', 'season': None, 'imdb_id': u'1073241', 'year': u'2008', 'type': u'movie', 'id': u'14841'}, 'similarity': 1.0}
2015-02-28 15:53:01,978 NOTIFY   Searching subs for 'Nothing But the Truth'
2015-02-28 15:53:01,978 DEBUG    loading /util/carrega_legendas_busca_filme/14841/1
2015-02-28 15:53:02,372 DEBUG    Subtitles found for 14841:
    {'rating': 10, 'hash': u'c7d50e660aafc1ddca1cf3b79bdfcca4', u'language': 'pb', 'title': u'Nothing_But_the_Truth', 'downloads': 256, 'release': 'Faces.da.Verdade.Dual.ptbr.eng.DvdRip.Xvid.Ac3.Brazilinjapan.by.cinefila', 'flag': 'http://i.legendas.tv/idioma/icon_brazil.png', 'date': datetime.datetime(2010, 2, 25, 13, 19), 'highlight': False, 'user_name': 'cinefala', 'pack': False}
    {'rating': 10, 'hash': u'13d7775dd3075a6c59fc0b52ba3b0aa1', u'language': 'pb', 'title': u'Nothing_But_the_Truth', 'downloads': 389, 'release': 'Nothing.But.The.Truth.2008.BRRip.H264.AAC-SecretMyth.(Kingdom-Release)', 'flag': 'http://i.legendas.tv/idioma/icon_brazil.png', 'date': datetime.datetime(2010, 2, 12, 13, 10), 'highlight': False, 'user_name': 'gamobra', 'pack': False}
    {'rating': 10, 'hash': u'46e2c9179ecc6e0fdd75912fd37b9814', u'language': 'pb', 'title': u'Nothing_But_the_Truth', 'downloads': 388, 'release': 'Faces.Da.Verdade.DVDRip.Dual.XviD.MP3-ZAMENGO', 'flag': 'http://i.legendas.tv/idioma/icon_brazil.png', 'date': datetime.datetime(2010, 2, 3, 21, 50), 'highlight': False, 'user_name': 'jcbandeira', 'pack': False}
    {'rating': 10, 'hash': u'1959971213897a145437ca7423670e67', u'language': 'pb', 'title': u'Nothing_But_the_Truth', 'downloads': 975, 'release': 'Nothing.But.The.Truth.2008.LiMiTED.720p.BluRay.x264-ARiGOLD', 'flag': 'http://i.legendas.tv/idioma/icon_brazil.png', 'date': datetime.datetime(2010, 1, 20, 19, 43), 'highlight': False, 'user_name': 'acnBR', 'pack': False}
    {'rating': 10, 'hash': u'b429ce5ebcea1c42755bd43fa6ed68ae', u'language': 'pb', 'title': u'Nothing_But_the_Truth', 'downloads': 1732, 'release': 'Nothing.But.The.Truth.LIMITED.DVDRip.XviD.AC3-DEViSE', 'flag': 'http://i.legendas.tv/idioma/icon_brazil.png', 'date': datetime.datetime(2009, 4, 28, 23, 0), 'highlight': False, 'user_name': 'ampg4', 'pack': False}
    {'rating': 10, 'hash': u'70fcfefd6b4ae9dba758f742d1744017', u'language': 'pb', 'title': u'Nothing_But_the_Truth', 'downloads': 4082, 'release': 'Nothing.But.The.Truth.2008.DvdRip-FxM', 'flag': 'http://i.legendas.tv/idioma/icon_brazil.png', 'date': datetime.datetime(2009, 4, 17, 12, 52), 'highlight': False, 'user_name': 'gunca', 'pack': False}
    {'rating': 10, 'hash': u'50f437e78ba63b578f3f816eee520d5d', u'language': 'pb', 'title': u'Nothing_But_the_Truth', 'downloads': 1249, 'release': 'Nothing.But.The.Truth.LiMiTED.DVDRip.XviD-ARiGOLD', 'flag': 'http://i.legendas.tv/idioma/icon_brazil.png', 'date': datetime.datetime(2009, 4, 17, 0, 5), 'highlight': False, 'user_name': 'daniellce', 'pack': False}
    {'rating': 10, 'hash': u'4f1b1fe567978e6b55c4847732b69468', u'language': 'pb', 'title': u'Nothing_But_the_Truth', 'downloads': 176, 'release': 'Nothing.But.The.Truth.2008.DVDRip.XVID.AC3-TST', 'flag': 'http://i.legendas.tv/idioma/icon_brazil.png', 'date': datetime.datetime(2009, 4, 16, 22, 19), 'highlight': False, 'user_name': 'ricklaferla', 'pack': False}
    {'rating': 10, 'hash': u'24739179e3148c0b4ce70f005d0906ab', u'language': 'pb', 'title': u'Nothing_But_the_Truth', 'downloads': 1033, 'release': 'Nothing.But.The.Truth.LiMiTED.DVDRip.XviD-ARiGOLD', 'flag': 'http://i.legendas.tv/idioma/icon_brazil.png', 'date': datetime.datetime(2009, 4, 15, 14, 58), 'highlight': False, 'user_name': 'j708', 'pack': False}
    {'rating': 10, 'hash': u'fe1a037d75cf46a3c27d82e3e0fe22d6', u'language': 'pb', 'title': u'Nothing_But_the_Truth', 'downloads': 5431, 'release': 'Nothing.But.The.Truth.2008.DVDSCR.XviD-ARiGOLD', 'flag': 'http://i.legendas.tv/idioma/icon_brazil.png', 'date': datetime.datetime(2009, 2, 26, 9, 49), 'highlight': True, 'user_name': 'alcobor', 'pack': False}
2015-02-28 15:53:02,383 NOTIFY   10 subtitles found
2015-02-28 15:53:02,386 DEBUG    Ranked subtitles for {'title_br': u'Nothing But the Truth', u'search': u'Nothing But the Truth', u'episode': u'', 'thumb': u'http://i.legendas.tv/poster/tt1073241.jpg', 'title': u'Nothing But the Truth', u'season': None, u'filename': u'Nothing But The Truth 2008 720p BluRay x264-ARiGOLD', 'imdb_id': u'1073241', 'year': u'2008', 'release': u'Nothing But The Truth 2008 720p BluRay x264 ARiGOLD', u'dirname': u'legendastv', u'type': u'movie', 'id': u'14841'}:
    {'rating': 10, 'hash': u'fe1a037d75cf46a3c27d82e3e0fe22d6', u'language': 'pb', 'title': u'Nothing_But_the_Truth', 'downloads': 5431, 'release': 'Nothing.But.The.Truth.2008.DVDSCR.XviD-ARiGOLD', 'flag': 'http://i.legendas.tv/idioma/icon_brazil.png', u'score': 8.458762886597938, 'date': datetime.datetime(2009, 2, 26, 9, 49), 'highlight': True, 'user_name': 'alcobor', 'pack': False}
    {'rating': 10, 'hash': u'1959971213897a145437ca7423670e67', u'language': 'pb', 'title': u'Nothing_But_the_Truth', 'downloads': 975, 'release': 'Nothing.But.The.Truth.2008.LiMiTED.720p.BluRay.x264-ARiGOLD', 'flag': 'http://i.legendas.tv/idioma/icon_brazil.png', u'score': 8.270104895104895, 'date': datetime.datetime(2010, 1, 20, 19, 43), 'highlight': False, 'user_name': 'acnBR', 'pack': False}
    {'rating': 10, 'hash': u'13d7775dd3075a6c59fc0b52ba3b0aa1', u'language': 'pb', 'title': u'Nothing_But_the_Truth', 'downloads': 389, 'release': 'Nothing.But.The.Truth.2008.BRRip.H264.AAC-SecretMyth.(Kingdom-Release)', 'flag': 'http://i.legendas.tv/idioma/icon_brazil.png', u'score': 7.62079831932773, 'date': datetime.datetime(2010, 2, 12, 13, 10), 'highlight': False, 'user_name': 'gamobra', 'pack': False}
    {'rating': 10, 'hash': u'70fcfefd6b4ae9dba758f742d1744017', u'language': 'pb', 'title': u'Nothing_But_the_Truth', 'downloads': 4082, 'release': 'Nothing.But.The.Truth.2008.DvdRip-FxM', 'flag': 'http://i.legendas.tv/idioma/icon_brazil.png', u'score': 7.273226773226773, 'date': datetime.datetime(2009, 4, 17, 12, 52), 'highlight': False, 'user_name': 'gunca', 'pack': False}
    {'rating': 10, 'hash': u'50f437e78ba63b578f3f816eee520d5d', u'language': 'pb', 'title': u'Nothing_But_the_Truth', 'downloads': 1249, 'release': 'Nothing.But.The.Truth.LiMiTED.DVDRip.XviD-ARiGOLD', 'flag': 'http://i.legendas.tv/idioma/icon_brazil.png', u'score': 7.268681318681319, 'date': datetime.datetime(2009, 4, 17, 0, 5), 'highlight': False, 'user_name': 'daniellce', 'pack': False}
    {'rating': 10, 'hash': u'24739179e3148c0b4ce70f005d0906ab', u'language': 'pb', 'title': u'Nothing_But_the_Truth', 'downloads': 1033, 'release': 'Nothing.But.The.Truth.LiMiTED.DVDRip.XviD-ARiGOLD', 'flag': 'http://i.legendas.tv/idioma/icon_brazil.png', u'score': 7.265934065934066, 'date': datetime.datetime(2009, 4, 15, 14, 58), 'highlight': False, 'user_name': 'j708', 'pack': False}
    {'rating': 10, 'hash': u'4f1b1fe567978e6b55c4847732b69468', u'language': 'pb', 'title': u'Nothing_But_the_Truth', 'downloads': 176, 'release': 'Nothing.But.The.Truth.2008.DVDRip.XVID.AC3-TST', 'flag': 'http://i.legendas.tv/idioma/icon_brazil.png', u'score': 7.218165854763792, 'date': datetime.datetime(2009, 4, 16, 22, 19), 'highlight': False, 'user_name': 'ricklaferla', 'pack': False}
    {'rating': 10, 'hash': u'b429ce5ebcea1c42755bd43fa6ed68ae', u'language': 'pb', 'title': u'Nothing_But_the_Truth', 'downloads': 1732, 'release': 'Nothing.But.The.Truth.LIMITED.DVDRip.XviD.AC3-DEViSE', 'flag': 'http://i.legendas.tv/idioma/icon_brazil.png', u'score': 6.992931825456097, 'date': datetime.datetime(2009, 4, 28, 23, 0), 'highlight': False, 'user_name': 'ampg4', 'pack': False}
    {'rating': 10, 'hash': u'c7d50e660aafc1ddca1cf3b79bdfcca4', u'language': 'pb', 'title': u'Nothing_But_the_Truth', 'downloads': 256, 'release': 'Faces.da.Verdade.Dual.ptbr.eng.DvdRip.Xvid.Ac3.Brazilinjapan.by.cinefila', 'flag': 'http://i.legendas.tv/idioma/icon_brazil.png', u'score': 6.528455284552845, 'date': datetime.datetime(2010, 2, 25, 13, 19), 'highlight': False, 'user_name': 'cinefala', 'pack': False}
    {'rating': 10, 'hash': u'46e2c9179ecc6e0fdd75912fd37b9814', u'language': 'pb', 'title': u'Nothing_But_the_Truth', 'downloads': 388, 'release': 'Faces.Da.Verdade.DVDRip.Dual.XviD.MP3-ZAMENGO', 'flag': 'http://i.legendas.tv/idioma/icon_brazil.png', u'score': 6.179487179487179, 'date': datetime.datetime(2010, 2, 3, 21, 50), 'highlight': False, 'user_name': 'jcbandeira', 'pack': False}
2015-02-28 15:53:02,400 NOTIFY   Downloading 'Nothing.But.The.Truth.2008.DVDSCR.XviD-ARiGOLD' from 'alcobor'
2015-02-28 15:53:02,400 DEBUG    Downloading archive for subtitle from /downloadarquivo/fe1a037d75cf46a3c27d82e3e0fe22d6
2015-02-28 15:53:04,107 DEBUG    Using cached file
2015-02-28 15:53:04,107 DEBUG    Archive saved as '/home/osmc/.cache/legendastv/archives/alcoborc87ce56448f623c506eb6f3e6bf4b030.rar'
2015-02-28 15:53:04,108 DEBUG    2 files in archive 'alcoborc87ce56448f623c506eb6f3e6bf4b030.rar': [u'Nothing But The Truth 2008 DVDSCR XviD-ARiGOLD.srt', u'Legendas.tv.txt']
2015-02-28 15:53:04,108 INFO     1 extracted files in '/home/osmc/.cache/legendastv/archives/alcoborc87ce56448f623c506eb6f3e6bf4b030.rar', filtered by [u'srt']
    u'/home/osmc/.cache/legendastv/archives/alcoborc87ce56448f623c506eb6f3e6bf4b030/Nothing But The Truth 2008 DVDSCR XviD-ARiGOLD.srt'
2015-02-28 15:53:04,109 DEBUG    Arguments: Namespace(backup=True, blacklistfile=u'/home/osmc/.config/legendastv/srtclean_blacklist.txt', encoding=None, fallback='windows-1252', in_place=True, loglevel=20, output_encoding=u'UTF-8', paths=[u'/home/osmc/.cache/legendastv/archives/alcoborc87ce56448f623c506eb6f3e6bf4b030/Nothing But The Truth 2008 DVDSCR XviD-ARiGOLD.srt'], rebuild_index=True, recursive=False)
[DEBUG] Arguments: Namespace(backup=True, blacklistfile=u'/home/osmc/.config/legendastv/srtclean_blacklist.txt', encoding=None, fallback='windows-1252', in_place=True, loglevel=20, output_encoding=u'UTF-8', paths=[u'/home/osmc/.cache/legendastv/archives/alcoborc87ce56448f623c506eb6f3e6bf4b030/Nothing But The Truth 2008 DVDSCR XviD-ARiGOLD.srt'], rebuild_index=True, recursive=False)
2015-02-28 15:53:04,109 INFO     Processing subtitle: '/home/osmc/.cache/legendastv/archives/alcoborc87ce56448f623c506eb6f3e6bf4b030/Nothing But The Truth 2008 DVDSCR XviD-ARiGOLD.srt'
[INFO ] Processing subtitle: '/home/osmc/.cache/legendastv/archives/alcoborc87ce56448f623c506eb6f3e6bf4b030/Nothing But The Truth 2008 DVDSCR XviD-ARiGOLD.srt'
2015-02-28 15:53:04,116 DEBUG    Auto-detected encoding: 'iso-8859-1'
[DEBUG] Auto-detected encoding: 'iso-8859-1'
2015-02-28 15:53:04,171 NOTIFY   DONE!
[NOTIFY] DONE!

I have experimented some changes that solves the issue but right now I have no time to describe/discuss them. I'll keep you informed

MestreLion commented 9 years ago

No, it had enough information to pick a better candidate before the download. In your case the 2nd candidate should've been chosen. It scored really close to the 1st, but the 1st scored a few extra points because it is a highlighted subtitle.

The solution is simple: fine-tune the weights in rankSubtitles(). Try promoting similary from 5 to 6 and demoting highlight from 2 to 1 and see if it picks the right candidate.