Vaibhavs10 / insanely-fast-whisper

Apache License 2.0
7.65k stars 537 forks source link

Timestamps are too tight when repetition_penalty is present #208

Open Brodski opened 6 months ago

Brodski commented 6 months ago

Title says it all. Is there something that could be done to make the timestamps more reasonable so they dont break up mid sentence?

Here is my code and a couple comparisons after it.

    pipe = pipeline(
        "automatic-speech-recognition",
        model=model_size_insane, # large-v3
        torch_dtype=torch.float16,
        device=my_device,
        model_kwargs={"attn_implementation": "flash_attention_2"} if is_flash_attn_2_available() else {"attn_implementation": "sdpa"},
    )
    generate_kwargs = {
        "language": 'en',
        "temperature": 0.2,
        "repetition_penalty": 3.0,
        "task": "transcribe",
    }

    outputs = pipe(
        filename,
        chunk_length_s=30,
        batch_size=24,
        return_timestamps=True,
        generate_kwargs = generate_kwargs
    )
    return outputs

With a little formatting, here is the output of a transcribed section. As you can see in about ~7 seconds the output create timestamps for each word when the speaker was talking slowly:

00:11:29,440 --> 00:11:29,460: To figure out the words you kno?
00:11:29,919 --> 00:11:29,940: Cry..
00:11:32,759 --> 00:11:32,779: Im crying a little bit
00:11:33,879 --> 00:11:33,899: But im can' t
00:11:37,559 --> 00:11:37,580: ...
00:11:37,919 --> 00:11:37,940: Uh
00:11:38,399 --> 00:11:38,419: Don''t
00:11:38,519 --> 00:11:38,539: You
00:11:39,980 --> 00:11:40,000: Know
00:11:40,159 --> 00:11:40,179: Dont
00:11:40,519 --> 00:11:40,539: Dare
00:11:41,080 --> 00:11:41,100: Compliment
00:11:41,259 --> 00:11:41,279: Me
00:11:41,679 --> 00:11:41,700: Ankle
00:11:42,179 --> 00:11:42,200: Man
00:11:42,440 --> 00:11:42,460: Your
00:11:43,440 --> 00:11:43,460: Seriously
00:11:43,639 --> 00:11:43,659: One of
00:11:43,720 --> 00:11:43,740: The
00:11:44,120 --> 00:11:44,139: Nicest
00:11:44,480 --> 00:11:44,500: People
00:11:44,620 --> 00:11:44,639: That
00:11:44,799 --> 00:11:44,820: Ever
00:11:45,080 --> 00:11:45,360: Met Straight Up Don't you dare compliment me. Ankle, man! You're like seriously one of the nicest people I've ever met
00:11:46,620 --> 00:11:47,120: Like straight up
00:11:47,620 --> 00:11:47,720: Straight Up

But if I run the same but without repetition_penalty, the timestamps are more reasonable:

00:11:25,220 --> 00:11:28,279: It's hard to figure out the words you know?
00:11:29,379 --> 00:11:29,980: Crying
00:11:29,980 --> 00:11:32,759: I'm crying a little bit
00:11:32,759 --> 00:11:33,879: But i can't
00:11:33,879 --> 00:11:34,279: Like
00:11:34,279 --> 00:11:37,840: Uh
00:11:37,840 --> 00:11:38,440: Don' t
00:11:38,440 --> 00:11:41,279: You dare compliment me
00:11:41,279 --> 00:11:41,960: Ankleman
00:11:41,960 --> 00:11:44,440: You're seriously one of nicest people
00:11:44,440 --> 00:11:45,080: Straight up Don't you dare compliment me. Ankle, man! You're like seriously one of the nicest people I've ever met.
00:11:45,580 --> 00:11:46,580: Like straight up.
00:11:47,080 --> 00:11:47,539: Straight up.
00:11:47,600 --> 00:11:47,840: Ankle,
00:11:47,940 --> 00:11:48,419: you are
00:11:48,419 --> 00:11:49,980: like
00:11:49,980 --> 00:11:51,399: One of the nicest dudes
00:11:51,399 --> 00:11:53,840: that I have ever randomly talked to on the internet

It might be nice to have something like condition_on_previous_text=False and/or vad_filter=True. I was using that from other repos, like faster-whisper, and their output, though much much slower, was kinda better

00:11:25,220 --> 00:11:28,260: It's hard to, like, figure out the words, you know?
00:11:29,360 --> 00:11:29,800: Cry.
00:11:29,980 --> 00:11:30,660: I'm, um...
00:11:30,660 --> 00:11:33,020: I'm crying a little bit, man.
00:11:33,080 --> 00:11:34,980: But I can't, like...
00:11:35,540 --> 00:11:35,980: I can't...
00:11:37,300 --> 00:11:39,920: I don't, you know...
00:11:39,920 --> 00:11:41,240: Don't you dare compliment me.
00:11:41,340 --> 00:11:45,080: Ankle, man, you're, like, seriously one of the nicest people I've ever met.
00:11:45,560 --> 00:11:46,580: Like, straight up.
00:11:47,080 --> 00:11:47,540: Straight up.
Brodski commented 6 months ago

So not a perfect solution, but these configs seem to work pretty well. Based off of this issue, I changed chunk_length_s to 16 (https://github.com/Vaibhavs10/insanely-fast-whisper/issues/115 said a value < 30s will help). I then experimented and found "repetition_penalty": 1.25` worked well.

    generate_kwargs = {
        "language": 'en',
        "repetition_penalty": 1.25, # this helps
        "task": "transcribe",
    }
    outputs = pipe(
        filename,
        chunk_length_s=16,  # this helps too
        batch_size=24,
        return_timestamps=True,
        generate_kwargs = generate_kwargs
    )
    return outputs

Using:

Yall can close this if you want. I'm content with this fix

Brodski commented 2 months ago

follow up on this after ~4 months. I switched back to faster-whisper b/c Insanely-Fast-Whisper doesnt work for my goals.

Insanely-Fast-Whisper gets really confused with silence, giving me me weird hallucinations/repeated words. Then when I increase the repetition_penalty it gives weird output, like I've seen emojis if its too high.

And even then people naturally repeat themself often without it really being noticed by the ear, so high repetition_penalty doenst work well for casual conversation transcriptions imo.

This project doesn't work in my situation; long audio files of casual conversation, sometimes noisy, multiple minutes of silence or music.