Huggingface Models Converted Using this pipeline is too slow

ggerganov / whisper.cpp

Port of OpenAI's Whisper model in C/C++

MIT License

35.71k stars 3.64k forks source link

Huggingface Models Converted Using this pipeline is too slow #442

Open spygaurad opened 1 year ago

spygaurad commented 1 year ago

Question 1: This is a whisper medium model finetuned in Nepali language. The inference for audio of length 39 second takes forever(13 mins). Is there any issues with the ggml conversion? @ggerganov The same audio takes 70 seconds with medium.en model.

Question 2: The transcription output is of 30 second chunk, how to make it dynamic like the ggml-medium model?

ggerganov commented 1 year ago

Given these results, I believe the fine-tuned model does not output timestamp tokens for some reason. To confirm that, can you provide the output of the same run after adding -ps command line argument to make the tool print the special tokens in the output?

spygaurad commented 1 year ago

@ggerganov This is the script with -ps. One think i noticed is that the runs in my model is 5692 whereas in medium.en model, it is around 92.

I tried inference without timestamp option( -nt ), still taking too long.

ggerganov commented 1 year ago

I see the transcribe (50359) token is being decoded a lot of times for some reason. This is not supposed to happen. I just pushed a change to master to suppress the task tokens. Not sure if it would help, but you might want to give it another try.

spygaurad commented 1 year ago

I pulled the master , haven't noticed any performance change.

ggerganov commented 1 year ago

We still see the 50359 token - this is unexpected. I guess best option is to provide instructions for downloading the model so I can test it locally.

haozes commented 1 year ago

have same problem here.
after convert my fine-tuned model, it take low time on decode time:

whisper_print_timings:     load time =    73.56 ms
whisper_print_timings:     fallbacks =   2 p /   1 h
whisper_print_timings:      mel time =    33.36 ms
whisper_print_timings:   sample time =   928.93 ms /  1907 runs (    0.49 ms per run)
whisper_print_timings:   encode time =   129.43 ms /     1 runs (  129.43 ms per run)
whisper_print_timings:   decode time =  2592.87 ms /  1899 runs (    1.37 ms per run)
whisper_print_timings:    total time =  3770.88 ms

haozes commented 1 year ago

add: -nf, --no-fallback [false ] do not use temperature fallback while decoding now it works

whisper_print_timings:     load time =    62.46 ms
whisper_print_timings:     fallbacks =   0 p /   0 h
whisper_print_timings:      mel time =    34.00 ms
whisper_print_timings:   sample time =     3.18 ms /     8 runs (    0.40 ms per run)
whisper_print_timings:   encode time =   173.80 ms /     1 runs (  173.80 ms per run)
whisper_print_timings:   decode time =    11.11 ms /     8 runs (    1.39 ms per run)
whisper_print_timings:    total time =   296.14 ms