ggerganov / whisper.cpp

Port of OpenAI's Whisper model in C/C++
MIT License
35.94k stars 3.67k forks source link

Finetuning models for audio_ctx support #1951

Open abb128 opened 8 months ago

abb128 commented 8 months ago

It's possible to fine-tune models to be able to use audio_ctx more freely, without affecting their knowledge too much.

Example with default settings (notice the ~3x speed difference):

$ ./main -m ggml-model-whisper-tiny.en-q8_0.bin -f samples/jfk.wav -ac 500
[...]
[00:00:00.000 --> 00:00:09.760]   and so my fellow Americans ask not what your country can do for you ask what you can do for
[00:00:09.760 --> 00:00:10.760]   You are a country.

whisper_print_timings:     load time =    47.05 ms
whisper_print_timings:     fallbacks =   0 p /   1 h
whisper_print_timings:      mel time =    17.20 ms
whisper_print_timings:   sample time =   389.59 ms /   762 runs (    0.51 ms per run)
whisper_print_timings:   encode time =   191.74 ms /     2 runs (   95.87 ms per run)
whisper_print_timings:   decode time =     5.03 ms /     2 runs (    2.51 ms per run)
whisper_print_timings:   batchd time =  1040.05 ms /   752 runs (    1.38 ms per run)
whisper_print_timings:   prompt time =     0.00 ms /     1 runs (    0.00 ms per run)
whisper_print_timings:    total time =  1699.19 ms
$ ./main -m tiny_en_acft_q8_0.bin -f samples/jfk.wav -ac 500
[...]
[00:00:00.000 --> 00:00:07.880]   And so, my fellow Americans ask not what your country can do for you
[00:00:07.880 --> 00:00:09.880]   ask what you can do for your...
[00:00:09.880 --> 00:00:10.880]   country.

whisper_print_timings:     load time =    60.26 ms
whisper_print_timings:     fallbacks =   1 p /   0 h
whisper_print_timings:      mel time =    15.26 ms
whisper_print_timings:   sample time =    62.74 ms /   186 runs (    0.34 ms per run)
whisper_print_timings:   encode time =   208.25 ms /     2 runs (  104.13 ms per run)
whisper_print_timings:   decode time =    12.02 ms /     5 runs (    2.40 ms per run)
whisper_print_timings:   batchd time =   189.45 ms /   169 runs (    1.12 ms per run)
whisper_print_timings:   prompt time =     0.00 ms /     1 runs (    0.00 ms per run)
whisper_print_timings:    total time =   556.35 ms

Example with greedy search and no timestamps (notice it doesn't repeat itself):

$ ./main -m ggml-model-whisper-tiny.en-q8_0.bin -f samples/jfk.wav -nt -ng -nf -bo 1 -bs 1 -ac 500
[...]
 And so my fellow Americans ask not what your country can do for you. Ask what you can do for your country for you. Ask what you can do for your country. Ask what you can do for your country. Ask what you can do for your country. Ask what you can do for your country. Ask what you can do for your country. Ask what you can do for your country. Ask what you can do for your country. Ask what you can do for your country. Ask what you can do for your country. Ask what you can do for your country. Ask what you can do for your country. Ask what you can do for your country. Ask what you can do for your country. Ask what you can do for your country. Ask what you can do for your country. Ask what you can do for your country. Ask what you can do for your country. Ask what you can do for your country. Ask what you can do for your country. Ask what you can do for your country. Ask what you can do for your country. Ask what you can do

whisper_print_timings:     load time =    41.61 ms
whisper_print_timings:     fallbacks =   0 p /   0 h
whisper_print_timings:      mel time =    13.48 ms
whisper_print_timings:   sample time =    97.74 ms /     1 runs (   97.74 ms per run)
whisper_print_timings:   encode time =   114.27 ms /     1 runs (  114.27 ms per run)
whisper_print_timings:   decode time =   506.76 ms /   219 runs (    2.31 ms per run)
whisper_print_timings:   batchd time =     3.95 ms /     2 runs (    1.98 ms per run)
whisper_print_timings:   prompt time =     0.00 ms /     1 runs (    0.00 ms per run)
whisper_print_timings:    total time =   783.24 ms
$ ./main -m ft3-quant/tiny_en_acft_q8_0.bin -f samples/jfk.wav -nt -ng -nf -bo 1 -bs 1 -ac 500
[...]
 And so my fellow Americans ask not what your country can do for you, ask what you can do for your

whisper_print_timings:     load time =    46.31 ms
whisper_print_timings:     fallbacks =   0 p /   0 h
whisper_print_timings:      mel time =    16.60 ms
whisper_print_timings:   sample time =     9.33 ms /     1 runs (    9.33 ms per run)
whisper_print_timings:   encode time =    95.40 ms /     1 runs (   95.40 ms per run)
whisper_print_timings:   decode time =    47.55 ms /    22 runs (    2.16 ms per run)
whisper_print_timings:   batchd time =     3.45 ms /     2 runs (    1.73 ms per run)
whisper_print_timings:   prompt time =     0.00 ms /     1 runs (    0.00 ms per run)
whisper_print_timings:    total time =   222.61 ms

Models and method are available here: https://github.com/futo-org/whisper-acft

Feedback and comments are welcome! The finetuning method probably isn't perfect, it may need fewer epochs, more data or avoiding randomly subtracting from context too much, but it still produces good results.

Related to #137 but I thought to open a new issue for this to discuss this specific method.

(Edit: The original results were from an older version of whisper.cpp which showed a 10x speed difference with default beam search, I have updated the results to a56f435fd475afd7edf02bfbf9f8c77f527198c2 and the speed difference is no longer as significant, but is still there)

ggerganov commented 8 months ago

Wow! This looks like a very important work. Would love to give this a try at some point

Any reason to prefer -ac 500 over -ac 512? Round numbers are generally better for performance, though depending on the backend implementation there might not be much difference

Do the fine-tuned models work only for a specific value of -ac or it can be varied all the way to 1500?

abb128 commented 8 months ago

@ggerganov The audio context can be varied from roughly 100 all the way to 1500. You can use very low values sometimes but they may produce sketchy results or fall into repetition loop in the same way. More short examples in the training data may help mitigate this issue, I used google/fleurs which had a shortest example of 3.18s, meaning the model hasn't seen anything less than roughly -ac 159.

Context as low as -ac 32 does end up working with jfk.wav specifically:

$ ./main -m tiny_en_acft_q8_0.bin -f samples/jfk.wav -nt -ng -nf -bo 1 -bs 1 -ac 32
 And...

The reason I used -ac 500 was just to emphasize the difference with jfk.wav, because the default model just happens to not repeat itself with jfk.wav if you use -ac 512 in particular.

(normal model)
$ ./main -m ggml-model-whisper-tiny.en-q8_0.bin -f samples/jfk.wav -nt -ng -nf -bo 1 -bs 1 -ac 512
 And so my fellow Americans ask not what your country can do for you. Ask what you can do for your country.

This doesn't mean that the normal model will always work just fine if you just use -ac 512, there are many cases where 512 fails.

(normal model)
$ ./main -m ggml-model-whisper-tiny.en-q8_0.bin -f ~/Music/example3.wav -nt -ng -nf -bo 1 -bs 1 -ac 512
 people are never gonna know, you know what it is and there doesn't need to be that knowledge. He's in the fascinatingly awkward position of not being able to say how much he needs to be that knowledge. He's in the fascinatingly awkward position of not being able to say how much he needs to be that knowledge. He's in the fascinatingly awkward position of not being able to say how much he needs to be that knowledge. He's in the fascinatingly awkward position of not being able to say how much he needs to be that knowledge. He's in the fascinatingly awkward position of not being able to say how much he needs to be that knowledge. He's in the fascinatingly awkward position of not being able to say how much he needs to be that knowledge. He's in the fascinatingly awkward position of not being able to say how much he needs to be that knowledge. He's in the fascinatingly awkward position of not being able to say how much he needs to be that knowledge. He's in the fascinatingly awkward position of not being able to say how
(finetuned model)
$ ./main -m tiny_en_acft_q8_0.bin -f ~/Music/example3.wav -nt -ng -nf -bo 1 -bs 1 -ac 512
 people are never going to know, you know what it is and there doesn't need to be that knowledge. He's in the fascinatingly awkward position of not being able to say how much he
soupslurpr commented 8 months ago

Was the original tiny model audio_ctx scaled from 0 to 1500 just from the audio length being 0 to 30? Would be interesting to see the results of doing 0 to 1500 but + 256 up to max 1500 as that is what I'm using and it seems to work pretty well.

abb128 commented 8 months ago

@soupslurpr I did some preliminary tests on this, it seems like the tiny.en model doesn't react well to just +256, it needs +512 to finally get to something usable, whereas the finetuned model stays roughly stable. (top graph and bottom graph data is identical just zoomed differently) image

base.en +256 works though image

(2048 is the baseline, all of it gets clamped up to 1500)

Of course this is all evaluated in hf transformers' implementation which probably differs from whisper.cpp in many aspects, but I'd say it's a good indication for the finetuned models

soupslurpr commented 8 months ago

Hm interesting. So the whisper.cpp implementation might be more resilient to lower audio_ctx

Edit: actually maybe 512 seems to be needed for whisper.cpp as well. I didn't notice because I was including silence at the end which was increasing the audio_ctx used.

zhouwg commented 8 months ago

@abb128 , thanks so much.it's very helpful for this PoC

performance of real-time transcription on Xiaomi14 was improved very significantly

before fine-tune:

Screenshot from 2024-03-16 21-18-24

after fine-tune:

Screenshot from 2024-03-20 16-40-19

but this fine-tune also brings an unexpected side-effect:whispercpp would produce incorrect/repeat tokens or app would crash suddenly.


p_params->max_tokens        = 256;
p_params->temperature_inc   = 0.0f;
p_params->audio_ctx         = std::min(1500, (int)ceil((double)num_samples / (double)(320.0)) + 16);
if (WHISPER_SAMPLING_GREEDY == n_decoding_mode) {
    p_params->strategy = WHISPER_SAMPLING_GREEDY;
    p_params->greedy.best_of = 1;//https://github.com/openai/whisper/blob/f82bc59f5ea234d4b97fb2860842ed38519f7e65/whisper/transcribe.py#L264
} else {
   p_params->strategy               = WHISPER_SAMPLING_BEAM_SEARCH;
   p_params->beam_search.beam_size  = 5;//https://github.com/openai/whisper/blob/f82bc59f5ea234d4b97fb2860842ed38519f7e65/whisper/transcribe.py#L265
   p_params->greedy.best_of         = 5;
}

btw, I'm sorry to interrupt to you:I really do NOT know the meaning of above code, could you help to point out what's/where is the problem in above code? thanks so much.