abdeladim-s / subsai

🎞️ Subtitles generation tool (Web-UI + CLI + Python package) powered by OpenAI's Whisper and its variants 🎞️
https://abdeladim-s.github.io/subsai/
GNU General Public License v3.0
1.16k stars 96 forks source link

Providing lyrics / Dictionary / Hints #38

Open jmealo opened 1 year ago

jmealo commented 1 year ago

Hello, this project is awesome! I'm trying to make a tool that generates karaoke timings automatically... I was wondering if there's a method of providing the lyrics ahead of time to the model so it could just determine the timing? It's already 70-90% of the way there, but matching it based on syllables/lines/string similarity can be problematic because when it's wrong, it's wildly wrong.

abdeladim-s commented 1 year ago

hello @jmealo, 'Glad you found the project useful!

If I understand your question correctly, you already have the transcription and you want to generate just the timing for it, is this correct ? But I don't understand the last part: matching it based on syllables/lines/string similarity can be problematic because when it's wrong, it's wildly wrong, Could you please provide an example ?

jmealo commented 1 year ago

Thanks for the quick reply:

Current workflow: If I don't provide the lyrics, the parts that are incorrect are not a string similarity match or a syllable match. I tried many variations and couldn't land on one that worked reliably.

I'm just outlining my current workflow and why it would be helpful to provide a transcript to steer it in the right direction.

If the models don't support it I may be able to supplement it with a model that does. The lyrics are explicit, I can try to find a PG one.

abdeladim-s commented 1 year ago

I really don't understand yet :sweat_smile:

But I can see the problem with music lyrics, the results won't be as good, because, besides the music background, sometimes the words and the sentences are not on their regular form because of artistic choices. In addition, I guess whisper was not trained on this kind of tasks.

Yeah, I think the whisper model does not support that, you cannot feed the transcription, it only accepts audio. Do you have a model that supports this ?

That being said, if you share an example, maybe we can think a solution together.

jmealo commented 1 year ago

I'd be happy to share what I have when I have a cohesive script and some samples.

On Mon, Jun 12, 2023, 8:49 PM Abdeladim Sadiki @.***> wrote:

I really don't understand yet 😅

But I can see the problem with music lyrics, the results won't be as good, because, besides the music background, sometimes the words and the sentences are not on their regular form because of artistic choices. In addition, I guess whisper was not trained on this kind of tasks.

Yeah, I think the whisper model does not support that, you cannot feed the transcription, it only accepts audio. Do you have a model that supports this ?

That being said, if you share an example, maybe we can think a solution together.

— Reply to this email directly, view it on GitHub https://github.com/abdeladim-s/subsai/issues/38#issuecomment-1588322125, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAIEMKAK5KZ62WG5HHCBHVTXK62KTANCNFSM6AAAAAAZEAYBZE . You are receiving this because you were mentioned.Message ID: @.***>

abdeladim-s commented 1 year ago

Great @jmealo Looking forward it :+1: