Describe the bug
We get quite a few incorrect language detections from the video classifier (especially for TV shows) due to how it looks for 3-letter ISO language codes in the section after the episode number.
To Reproduce
If you have "Series Name EP05E01 Episode Name Includes San Francisco" then this gets detected as being Sanskrit, or something with "Mac" in the episode name would be detected as Macedonian. Incidentally I'm not sure if Sanskrit should even be a possibility here as I think it's an ancient written language...?
Expected behavior
3-letter language codes should not be confused with 3 letter words that happen to be a language code. Ideally this should not be at the cost of missing genuine language codes. Perhaps something is needed to detect the episode name part though it could be tricky to do this reliably.
The TMDB API provides endpoints for alternative and country dependent titles for shows and movies. It might be possible to reconstruct the language of the file with this information
Describe the bug We get quite a few incorrect language detections from the video classifier (especially for TV shows) due to how it looks for 3-letter ISO language codes in the section after the episode number.
To Reproduce If you have "Series Name EP05E01 Episode Name Includes San Francisco" then this gets detected as being Sanskrit, or something with "Mac" in the episode name would be detected as Macedonian. Incidentally I'm not sure if Sanskrit should even be a possibility here as I think it's an ancient written language...?
Expected behavior 3-letter language codes should not be confused with 3 letter words that happen to be a language code. Ideally this should not be at the cost of missing genuine language codes. Perhaps something is needed to detect the episode name part though it could be tricky to do this reliably.