bitmagnet-io / bitmagnet

A self-hosted BitTorrent indexer, DHT crawler, content classifier and torrent search engine with web UI, GraphQL API and Servarr stack integration.
https://bitmagnet.io/
MIT License
2.51k stars 102 forks source link

Language detection glitches in video classifier #58

Open mgdigital opened 1 year ago

mgdigital commented 1 year ago

Describe the bug We get quite a few incorrect language detections from the video classifier (especially for TV shows) due to how it looks for 3-letter ISO language codes in the section after the episode number.

To Reproduce If you have "Series Name EP05E01 Episode Name Includes San Francisco" then this gets detected as being Sanskrit, or something with "Mac" in the episode name would be detected as Macedonian. Incidentally I'm not sure if Sanskrit should even be a possibility here as I think it's an ancient written language...?

Expected behavior 3-letter language codes should not be confused with 3 letter words that happen to be a language code. Ideally this should not be at the cost of missing genuine language codes. Perhaps something is needed to detect the episode name part though it could be tricky to do this reliably.

nilsherzig commented 10 months ago

The TMDB API provides endpoints for alternative and country dependent titles for shows and movies. It might be possible to reconstruct the language of the file with this information