McCloudS / subgen

Autogenerate subtitles using OpenAI Whisper Model via Jellyfin, Plex, Emby, Tautulli, or Bazarr
MIT License
447 stars 44 forks source link

Improve LRC generation #83

Open Chaphasilor opened 2 months ago

Chaphasilor commented 2 months ago

Currently there are a few issues when generating LRC files for lyrics:

Not sure how much of these issues are under your control or could be manually fixed, or if you're even willing to improve the LRC generation. But I wanted to discuss these issues anyway :)

All of this was tested using the default settings, aside from setting up the Jellyfin connection and a transcribe folder. So maybe using another model is a better solution? Although I don't think all issues would be solved by that.

McCloudS commented 2 months ago

Hey, thanks for the writeup! LRC was added by request of someone else, I haven't used it.

As far as I know, there is no way to handle the instrumental/music aspect using the current model. As you mentioned, increasing the 'detect-language' is only for the actual detect-language webhook, it will not have any impact outside of using it in Bazarr at this point.

Having a 'library' of forced-languages doesn't work in my head. Say I want it to be fr or en, but Whisper detects it as German. What's my next step?

To fix the line breaks, you could change the CUSTOM_REGROUP back to cm_sp=,* /,_sg=.5_mg=.3+3_sp=.* /。/?/? and it should clean it up.

The rest of what you are seeing are hallucinations caused by the model, and there is no way to fix them here (see: https://github.com/openai/whisper/discussions/928 and https://github.com/openai/whisper/discussions/679). They would have to be fixed upstream.

If you wanted to give a hack at fixing some of the other stuff for LRC, i'd take a PR. You'd probably want to look at https://github.com/McCloudS/subgen/blob/5c96212570f8fa9c1c7c64fcd29083a06c420fa9/subgen.py#L486 and https://github.com/McCloudS/subgen/blob/5c96212570f8fa9c1c7c64fcd29083a06c420fa9/subgen.py#L550

Chaphasilor commented 2 months ago

Having a 'library' of forced-languages doesn't work in my head. Say I want it to be fr or en, but Whisper detects it as German. What's my next step?

I was hoping the model would maybe offer multiple languages with varying confidence scores (e.g. de: 0.8, en: 0.4, fr: 0.2), which would allow you to use the matching language with the highest score, falling back to the originally detected language if none of the allow-listed languages is present.
But I take it that isn't the case?

I wasn't aware of the custom regroup, I'll give it a try!

Would you be opposed to some kind of static regex to get rid of some of the more common hallucinations?

McCloudS commented 2 months ago

I think the model will provide an array of probabilities, though I haven't messed with it. Your idea makes sense now. I'll see if there is any easy way to get that array.

Yup, open to any regex you want to try to throw in.

Chaphasilor commented 2 months ago

As you mentioned, increasing the 'detect-language' is only for the actual detect-language webhook, it will not have any impact outside of using it in Bazarr at this point.

Do I understand correctly that the detect-language setting will only use the detected language for sending it to Bazarr, but not for the STT? Or is Bazarr actually doing the detection, and sending the result back to subgen?

McCloudS commented 2 months ago

Bazarr will request detect-language if it doesn’t know the language of a file. Whisper does the detection and sends it back to Bazarr. Then Bazarr will use that to force the language on a subsequent call to generate a subtitle. Whereas the way the LRC is being made, Whisper autodetects the language and uses that for the rest of the file.

There’s no easy way to get the probabilities of languages without rewriting the flow of the program.

Chaphasilor commented 2 months ago

Okay, and is the whisper-based autodetection using the configured first 30s (DETECT_LANGUAGE_LENGTH), or another duration, or the entire file?
Because it seems like some languages should be easily detectable, but have a long instrumental intro.

I got a bit confused by your comment about the 'detect-language'...

McCloudS commented 2 months ago

At this point in time DETECT_LANGUAGE_LENGTH only works with Bazarr. I'm looking at adding it to the rest of the flow. What you're seeing now is the default 30 seconds Whisper uses.

On Thu, Apr 18, 2024 at 7:27 AM Chaphasilor @.***> wrote:

Okay, and is the whisper-based autodetection using the configured first 30s (DETECT_LANGUAGE_LENGTH), or another duration, or the entire file? Because it seems like some languages should be easily detectable, but have a long instrumental intro.

I got a bit confused by your comment about the 'detect-language'...

— Reply to this email directly, view it on GitHub https://github.com/McCloudS/subgen/issues/83#issuecomment-2063867624, or unsubscribe https://github.com/notifications/unsubscribe-auth/APJACQORSWK5XJMZ4ULYJDTY57C3XAVCNFSM6AAAAABGLEOKI2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANRTHA3DONRSGQ . You are receiving this because you commented.Message ID: @.***>

Chaphasilor commented 2 months ago

Alright, thanks for the clarification. Being able to configure it would be very useful for lyrics.
An option to set the duration as a percentage of track length would also be nice!

McCloudS commented 2 months ago

I'm working on it, but it may not come to fruition.

How many of your files are not in the language you want? Could you not force the language 100% of the time and get your desired transcription?