Improve LRC generation - Githubissues

Chaphasilor commented 2 months ago

Currently there are a few issues when generating LRC files for lyrics:

Instrumental tracks are not properly detected. Here are a few examples of LRC files generated for some of my music:

Examples
Direct - So Sure ```lrc [00:14.53] This is the end of this video, I hope you enjoyed this video, If you [00:20.19] did hit that thumbs up button, it helps me to make good content for [00:26.03] you, other then that, I will see you in tomorrow's video, peace out. [00:58.57] Thanks for watching, I hope you enjoyed this video, If you did hit [00:59.97] that thumbs up button, it helps me to make good content for you, [00:59.97] other then that, I will see you in tomorrow's video, peace out. [01:13.84] Thanks for watching, I hope you enjoyed this video, If you did hit [01:28.28] that thumbs up button, it helps me to make good content for you, [01:28.28] other then that, I will see you in tomorrow's video, peace out. [01:44.45] Thanks for watching, I hope you enjoyed this video, If you did hit [01:52.73] that thumbs up button, it helps me to make good content for you, [01:52.84] other then that, I will see you in tomorrow's video, peace out. [02:28.02] Thanks for watching, I hope you enjoyed this video, If you did hit [02:29.97] that thumbs up button, it helps me to make good content for you, [02:29.97] other then that, I will see you in tomorrow's video, peace out. [02:58.56] Thanks for watching, I hope you enjoyed this video, If you did hit [02:59.97] that thumbs up button, it helps me to make good content for you, [02:59.97] other then that, I will see you in tomorrow's video, peace out. [03:16.03] Thanks for watching, I hope you enjoyed this video, If you did hit [03:22.87] that thumbs up button, it helps me to make good content for you, [03:23.02] other then that, I will see you in tomorrow's video, peace out. [03:55.96] Thanks for watching, I hope you enjoyed this video, If you did hit [03:57.43] that thumbs up button, it helps me to make good content for you, [03:57.43] other then that, I will see you in tomorrow's video, peace out. [04:01.87] Thanks for watching, I hope you enjoyed this video, If you did hit [04:03.50] that thumbs up button, it helps me to make good content for you, [04:03.50] other then that, I will see you in tomorrow's video, peace out. ``` Direct - Opal ```lrc [00:28.30] Hello, and welcome to a new episode of my channel, where I'm going [00:29.53] to be showing you how to make the most of your time in your life. [00:29.53] I hope you enjoy this video, and I hope you enjoy the rest of your day. [00:58.57] I don't know what to do with my life, I don't know what to do with my life [01:28.57] I don't know what to do with my life [01:58.57] I don't know what to do with my life [02:28.58] I don't know what to do with my life [02:59.15] I don't know what to do with my life [03:14.96] I don't know what to do with my life ``` ENV - Brave ```lrc [01:34.43] ចតសកសាបនបាថកន [01:35.84] កាងត។ំរិណ្ gesamហែកសាង︶បោានដិْ។ [01:42.92] ឡឹូ្ក절បបោរិ ំហងកសំឯពពнова�ើមបាងleans�ន។ [01:44.31] ឡ � testimony ឡើង Standing�ហែ៖ង� Manufacture ``` Droptek - Science ```lrc [01:00.59] ប � Sugun ប ប ប ប ប ប ទ ᢔ ប ᢔ ។ �Что, ប ម។ voud paid ក ក ក ថ �athi យ០។ ម០ ។. [01:14.29] ប យ០ ម។ ។ ។ �ietet ។ып៙។។។ 🌈. ``` Falcon Funk - Catnip Trip (Perkulat0r Remix) ```lrc [00:00.00] పఢ్ధిటాల్ మాలో కేట్ మాలో క్ందిఎలిందలి మాలో మాలలో వారిందిలో [00:09.24] మాలో శంగామాండి వాభా ఉం పబింది మారింది �納డి వారింద కారిఁ ఎసావా [01:57.45] 5.5 cm x 5 cm [02:04.34] 6 cm x 6 cm [02:12.59] 7 cm x 7 cm [02:13.24] 8 cm x 8 cm [02:23.40] 9 cm x 9 cm [02:35.09] 10 cm x 10 cm [02:36.50] 11 cm x 11 cm [02:39.59] 12 cm x 12 cm [02:46.03] 13 cm x 13 cm [02:47.34] 14 cm x 14 cm [02:56.40] 15 cm x 15 cm [03:01.41] 16 cm x 16 cm [03:04.36] 17 cm x 17 cm [03:11.90] 18 cm x 18 cm [03:20.96] 19 cm x 19 cm [03:27.63] 20 cm x 20 cm [03:31.00] 21 cm x 21 cm [03:35.53] 22 cm x 21 cm [03:42.84] 23 cm x 23 cm [03:44.59] 24 cm x 24 cm [03:55.41] 25 cm x 25 cm [03:56.81] 26 cm x 26 cm [04:06.34] 27 cm x 27 cm [04:08.53] 29 cm x 29 cm [04:23.57] 29 cm x 29 cm ``` Falcon Funk & Bossfight ```lrc [00:19.35] Hey guys, welcome back to my channel, today [00:29.98] I'm going to be showing you how to create a [03:00.09] My, my, my, my, my ``` Intercom - Decoy World (feat. Park Avenue) (notice the "Thank you for watching" at the end, that is pretty common) ```lrc [00:11.83] I stayed awake last night, cause I couldn't close my eyes And see you another night [00:23.60] I drove myself crazy thinking You'd take my [00:28.80] wildest dreams and Tear them all to the ground [00:36.07] So I tried to create a decoy world for you To destroy in my mind [00:47.39] You can stay and believe You're tearing me apart [00:54.88] While I'm coming to life While I'm coming to life [01:45.82] You can stay and believe You're tearing me apart [01:47.85] I couldn't keep the secret You found my darkest [01:53.28] demons And brought them out in the light [01:59.84] So I ran to all the preachers Despite having every reason To shut down and mobilize [02:11.97] So I prayed to the gods For one last safe and grace against all odds [02:23.80] And I built an escape using all my energy Just to come back to life, back to life [02:59.12] I couldn't keep the secret You found my darkest [02:59.47] demons And brought them out in the light [03:08.71] So I ran to all the preachers Despite having every reason To shut down and mobilize [03:09.74] So I ran to all the preachers Despite having every reason To shut down and mobilize [03:49.69] Thank you for watching! ``` Hosini - Flyga ```lrc [01:03.17] ḩ� ning Ḥᶽ [01:12.65] She cures the tip in a combo lamp for 30 seconds. [01:12.65] She cures the tip in a combo lamp for 30 seconds. [01:28.54] She cures the tip in a combo lamp for 30 seconds. [01:31.34] She cures the tip in a combo lamp for 30 seconds. [01:50.14] She cures the tip in a combo lamp for 30 seconds. [01:52.93] She cures the tip in a combo lamp for 30 seconds. [02:07.73] She cures the tip in a combo lamp for 30 seconds. [02:10.53] She cures the tip in a combo lamp for 30 seconds. [02:25.78] She cures the tip in a combo lamp for 30 seconds. [02:31.22] She cures the tip in a combo lamp for 30 seconds. [02:38.58] She cures the tip in a combo lamp for 30 seconds. [02:52.34] She cures the tip in a combo lamp for 30 seconds. [02:55.13] She cures the tip in a combo lamp for 30 seconds. [03:00.38] She cures the tip in a combo lamp for 30 seconds. [03:12.13] She cures the tip in a combo lamp for 30 seconds. [03:20.40] She cures the tip in a combo lamp for 30 seconds. [03:30.28] She cures the tip in a combo lamp for 30 seconds. [03:33.65] She cures the tip in a combo lamp for 30 seconds. [03:48.68] She cures the tip in a combo lamp for 30 seconds. [03:51.47] She cures the tip in a combo lamp for 30 seconds. [04:09.47] She cures the tip in a combo lamp for 30 seconds. [04:23.02] She cures the tip in a combo lamp for 30 seconds. [04:24.89] She cures the tip in a combo lamp for 30 seconds. [04:27.83] She cures the tip in a combo lamp for 30 seconds. ``` Inova - Grime ```lrc [00:11.35] Music [02:46.36] Thanks for watching, I'll see you in the next one! [03:00.00] Thanks for watching, I'll see you in the next one! ``` Inova - Enraged ```lrc [00:29.64] Hubsan x4 H502E Desire [00:58.57] Thanks for watching please subscribe and hit that like button..... [01:28.57] Thanks for watching please subscribe and hit that like button..... [01:58.57] Thanks for watching please subscribe and hit that like button..... [02:26.81] Thanks for watching please subscribe and hit that like button..... [02:58.12] Thanks for watching please subscribe and hit that like button..... [03:16.56] Thanks for watching please subscribe and hit that like button..... ``` Airmov - PRESENCE ```lrc [00:28.57] This video is a derivative work of the Touhou Project. [03:32.50] You ```

It would be nice if there was a was to detect instrumental tracks and then skip generating lyrics. Maybe the ML model returns some confidence weights that could be used along with a threshold? The thresholds could also be different for audio files than for video files, to make sure subtitles aren't affected by higher thresholds.
Getting rid of the most common random phrases ("Thanks for watching", "Subscribe", "welcome to my channel", etc.) would also be a very nice addition.
Some non-instrumental tracks are also not properly detected:

Examples
Direct & Matt Van - I Don't Mind ```lrc [00:00.00] වූ්යෙන්ණය හැකින්තියිය, දැක් පිතින්යිටියිටට දැකින්තියි වීඩිනින්යි [00:11.96] සුයෙන්ට් හානමන්නමිටි හැකින්තිය හැකින්තයිිටිටිටටටට කළඩ� [00:31.64] ශන්හ් ඉඩින්මට කිරීම් ඉධා මන් කරන් කරන් අවශ්රයකරය සමුන් ඉඩින් කරන් [00:43.42] මයකර හැන්නමට හොඳින් මන්න ඉඩින් මයකරයේ හැන් මිහ්ලට කිරා ඔබනට හැන� [01:02.79] අත්ලට සමට කරමණකර ප෸ොහඩවඅි ඔබට එකතු පිසිසින් කරමණක වීත්සමේ. [01:16.20] මිශ්‍ර දින්හයක් දැනන් පිහිනහන් පින්හන්. [01:55.92] ඊලක් පිතු කෝ්බිතු හාරණයක හාරඟ්ඛකක් හෟ් පදිකියේ හාරගන කල් niin [02:14.08] මිත් කරමි, කරමි හැඩාවිකාඦ කරහරි හලලයක් ප්ළුදු සේ ඇත [02:24.78] මිශ්ර කරමි හැඩාවික් කරමි [02:28.56] කරමි කරමි, මිශ්රම අතිකර කරමි [03:45.74] අවශ් මම දින්තාකට කැහියක් එකතු මිශ්රීම් ඇත නැන් පිදුරින් කරන් රත් පිට් [03:47.13] නිටිෝපා කරන් එකතු පි මිශ්රීම් තුණකට මන්න පිට් ඔබට කරන් බිශ්රීම් � ```

I know there is a way to "force" detection of a certain language, but it would be nice to have some kind of allow-list instead (e.g. library only contains certain languages, and the detected language has to be one of those). I could also try increasing the language detection duration.
Sometimes a randomly-formatted "Music" section (or another random section) is created:

Examples
``` [00:28.57] 〔Music〕 ``` ``` [00:28.57] Music playing ``` ```lrc [00:28.57] 《Joy to the World》 ``` ```lrc [00:14.06] Music [00:15.46] Music [00:17.48] Music [00:19.98] Music ``` ```lrc [00:00.00] . [00:06.51] . ```
Lyrics lines often contain line breaks, which aren't properly detected by LRC parsers (since each line should be one lyric line with a time stamp, and the generated files are essentially a mix of synchronized and unsynchronized lyrics:

Examples
AWAY & Midoca & Dark Waves - Too Close ```lrc [00:13.48] Take the long way back to me It's the wrong way, has to be [00:26.69] You pull up in your car, then we sit out in the drive [00:30.28] But I keep the lights on, like you're still out on the highway [00:33.92] Practice in the mirror, everything you wanna say [00:37.85] Hope you come inside, tell me that you wanna stay [00:45.38] I feel alone when you get too close to me [00:52.10] It looks wrong, but we're just too close to see [00:59.38] It's cliche, but the writing's on the wall [01:06.48] Now I wonder why you even came home at all [01:21.07] When you get too close to me [01:38.00] Behind closed doors, it's a black hole [01:45.21] It's an old war with old souls [01:50.12] There was a place in my heart that only you could get to [01:54.64] Now you feel more like a stranger than before I met you [01:58.18] Let me hear the words, everything you never say [02:01.84] I hope you never let go when I push you away [02:09.21] I feel alone when you get too close to me [02:16.37] It looks wrong, but we're just too close to see [02:23.52] It's cliche, but the writing's on the wall [02:30.49] Now I wonder why you even came home at all [02:44.96] When you get too close to me [03:16.41] Have we lost who we are? [03:20.86] Trying to save what we have [03:23.78] Tell each other it's love, even though it feels bad [03:30.63] We should run for our lives [03:34.12] We should never look back [03:37.75] We're just too close to see [03:41.34] Being close makes us sad [04:04.12] Makes us sad [04:09.19] Being close makes us sad ``` Airmov & Trove - Make Me Break ````lrc [00:10.22] I lay awake, think of what I locked away Take [00:15.19] those secrets to the grave, if I can't cave [00:20.30] And it occurred I had taken all my turns I don't seem to ever learn, am I unsafe? [00:29.80] Alive if I say, I won't ever waste away All [00:34.50] this life is made for me, cause I know me well [00:39.17] If this is to be done, my will is all I've won Cause I [00:44.78] just can't stop holding on, holding on, holding on to you [00:58.79] Hold my hands, I won't let go, don't step, don't [01:03.84] step me down, this is how you make me break [01:26.40] All I can give now, deep from the underground [01:32.78] Take me away, take me away, take me away now [01:37.48] It's on my sleeve now, stitches tearing out [01:42.70] Take me away, take me away, take me away [01:46.42] Alive if I say, I won't ever waste away All [01:51.29] this life is made for me, cause I know me well [01:55.76] If this is to be done, my will is all I've won Cause I [02:01.57] just can't stop holding on, holding on, holding on to you [02:13.97] All I can give now, deep from the underground [02:24.58] Take me away, take me away, take me away now [02:37.46] Hold my hands, I won't let go, don't step, don't [02:39.93] step me down, this is how you make me break [03:07.46] Thanks for watching! ```

I'm guessing this is originates from improved subtitle formatting, using multiple lines. Since it doesn't work well for lyrics though, I'd suggest either removing the line breaks, or (if possible) making the detected lines shorter (maybe by making pause detection more "aggressive" or something, not sure if that's a thing) to properly split the lines.

Not sure how much of these issues are under your control or could be manually fixed, or if you're even willing to improve the LRC generation. But I wanted to discuss these issues anyway :)

All of this was tested using the default settings, aside from setting up the Jellyfin connection and a transcribe folder. So maybe using another model is a better solution? Although I don't think all issues would be solved by that.

McCloudS commented 2 months ago

Hey, thanks for the writeup! LRC was added by request of someone else, I haven't used it.

As far as I know, there is no way to handle the instrumental/music aspect using the current model. As you mentioned, increasing the 'detect-language' is only for the actual detect-language webhook, it will not have any impact outside of using it in Bazarr at this point.

Having a 'library' of forced-languages doesn't work in my head. Say I want it to be fr or en, but Whisper detects it as German. What's my next step?

To fix the line breaks, you could change the CUSTOM_REGROUP back to cm_sp=,* /，_sg=.5_mg=.3+3_sp=.* /。/?/？ and it should clean it up.

The rest of what you are seeing are hallucinations caused by the model, and there is no way to fix them here (see: https://github.com/openai/whisper/discussions/928 and https://github.com/openai/whisper/discussions/679). They would have to be fixed upstream.

If you wanted to give a hack at fixing some of the other stuff for LRC, i'd take a PR. You'd probably want to look at https://github.com/McCloudS/subgen/blob/5c96212570f8fa9c1c7c64fcd29083a06c420fa9/subgen.py#L486 and https://github.com/McCloudS/subgen/blob/5c96212570f8fa9c1c7c64fcd29083a06c420fa9/subgen.py#L550

Chaphasilor commented 2 months ago

Having a 'library' of forced-languages doesn't work in my head. Say I want it to be fr or en, but Whisper detects it as German. What's my next step?

I was hoping the model would maybe offer multiple languages with varying confidence scores (e.g. de: 0.8, en: 0.4, fr: 0.2), which would allow you to use the matching language with the highest score, falling back to the originally detected language if none of the allow-listed languages is present.
But I take it that isn't the case?

I wasn't aware of the custom regroup, I'll give it a try!

Would you be opposed to some kind of static regex to get rid of some of the more common hallucinations?

McCloudS commented 2 months ago

I think the model will provide an array of probabilities, though I haven't messed with it. Your idea makes sense now. I'll see if there is any easy way to get that array.

Yup, open to any regex you want to try to throw in.

Chaphasilor commented 2 months ago

As you mentioned, increasing the 'detect-language' is only for the actual detect-language webhook, it will not have any impact outside of using it in Bazarr at this point.

Do I understand correctly that the detect-language setting will only use the detected language for sending it to Bazarr, but not for the STT? Or is Bazarr actually doing the detection, and sending the result back to subgen?

McCloudS commented 2 months ago

Bazarr will request detect-language if it doesn’t know the language of a file. Whisper does the detection and sends it back to Bazarr. Then Bazarr will use that to force the language on a subsequent call to generate a subtitle. Whereas the way the LRC is being made, Whisper autodetects the language and uses that for the rest of the file.

There’s no easy way to get the probabilities of languages without rewriting the flow of the program.

Chaphasilor commented 2 months ago

Okay, and is the whisper-based autodetection using the configured first 30s (DETECT_LANGUAGE_LENGTH), or another duration, or the entire file?
Because it seems like some languages should be easily detectable, but have a long instrumental intro.

I got a bit confused by your comment about the 'detect-language'...

McCloudS commented 2 months ago

At this point in time DETECT_LANGUAGE_LENGTH only works with Bazarr. I'm looking at adding it to the rest of the flow. What you're seeing now is the default 30 seconds Whisper uses.

On Thu, Apr 18, 2024 at 7:27 AM Chaphasilor @.***> wrote:

Okay, and is the whisper-based autodetection using the configured first 30s (DETECT_LANGUAGE_LENGTH), or another duration, or the entire file? Because it seems like some languages should be easily detectable, but have a long instrumental intro.

I got a bit confused by your comment about the 'detect-language'...

— Reply to this email directly, view it on GitHub https://github.com/McCloudS/subgen/issues/83#issuecomment-2063867624, or unsubscribe https://github.com/notifications/unsubscribe-auth/APJACQORSWK5XJMZ4ULYJDTY57C3XAVCNFSM6AAAAABGLEOKI2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANRTHA3DONRSGQ . You are receiving this because you commented.Message ID: @.***>

Chaphasilor commented 2 months ago

Alright, thanks for the clarification. Being able to configure it would be very useful for lyrics.
An option to set the duration as a percentage of track length would also be nice!

McCloudS commented 2 months ago

I'm working on it, but it may not come to fruition.

How many of your files are not in the language you want? Could you not force the language 100% of the time and get your desired transcription?

McCloudS / subgen

Improve LRC generation #83