McCloudS / subgen

Autogenerate subtitles using OpenAI Whisper Model via Jellyfin, Plex, Emby, Tautulli, or Bazarr
MIT License
453 stars 45 forks source link

Patch: fix "language_code" in /detect-language endpoint #91

Closed benjroy closed 2 months ago

benjroy commented 2 months ago

First off, thanks for this great tool @McCloudS ! Came across this project over the weekend, and it was super easy to wire into my setup at home. I'm using the bazarr + plex + tautulli bits for integrations.

This PR is a patch for the /detect-language POST endpoint's response.

It prevents Bazarr from throttling whisperai (subgen) provider for 10 minutes with LanguageReverseError

CHANGED:

UNCHANGED:

Responses before:
{"detected_language":"chinese","language_code":"chinese"}
{"detected_language":"english","language_code":"english"}
{"detected_language":"nynorsk","language_code":"nynorsk"}
{"detected_language":"swedish","language_code":"swedish"}
{"detected_language":"welsh","language_code":"welsh"}

Responses after:
{"detected_language":"chinese","language_code":"zh"}
{"detected_language":"english","language_code":"en"}
{"detected_language":"nynorsk","language_code":"nn"}
{"detected_language":"swedish","language_code":"sv"}
{"detected_language":"welsh","language_code":"cy"}

Background: I have a huge queue of items in bazarr, and I've been using subgen to generate all of the subtitles over the past few days. in Bazarr, I kept getting intermittent LanguageReverseErrors when using subgen as the whisperai provider. Every time it hit that, Bazarr would throttle the whisperai provider for 10 minutes.

Since whisperai is my only real provider in use, all the remaining items in the queue would blow through and the search task would complete as fast as bazarr could read the queue and say it finished processing the item (with no providers left available, the search for the queued item happens, returning 0 results without error).

Each night the task would run it would only process a few to a few dozen items before crashing out. With a "Wanted" list of > 18000 items, the past few days hadn't made much of dent at all in that list.

I initially thought it was something to do with the nynorsk language in bazarr since that was the language in the log that preceded every LanguageReverseError stack trace in the bazarr logs. Eventually I saw a couple others (malay, khmer) as well, and that got me looking into it.

Here's what I traced:

Bazarr uses both the "language_code" and the "detected_language" from the response: https://github.com/morpheus65535/bazarr/blob/5429749e72bcbcd960e63704bfac522bd87cc244/libs/subliminal_patch/providers/whisperai.py#L259C1-L259C94

It first tries to match the language from the "language_code", expecting a two-letter code: https://github.com/morpheus65535/bazarr/blob/5429749e72bcbcd960e63704bfac522bd87cc244/libs/subliminal_patch/providers/whisperai.py#L160-L161

When that fails, it tries to match the language by name: https://github.com/morpheus65535/bazarr/blob/5429749e72bcbcd960e63704bfac522bd87cc244/libs/subliminal_patch/providers/whisperai.py#L162-L163

For many languages with simple names, that worked: "english", "french", "german", etc... For languages that didn't map exactly, the final Language.fromname(name) would raise the LanguageReverseError that throttle the provider for 10 min

For the "nynorsk" example, the name that it was trying to match was not "nynorsk" but instead something like "Norwegian Nynorsk": https://github.com/morpheus65535/bazarr/blob/5429749e72bcbcd960e63704bfac522bd87cc244/libs/babelfish/data/iso-639-3.tab#L4748

I patched my local Bazarr yesterday with a workaround to do this reverse lookup and that fixed the throttling issue to get my queue processing all day yesterday. This PR should fix the root of the issue I've been experiencing without needing that workaround in bazarr.

McCloudS commented 2 months ago

Thanks for finding this. I ran into this once or twice, but wasn't happening often enough for me to chase that gremlin.