argmaxinc / WhisperKit

On-device Speech Recognition for Apple Silicon
https://takeargmax.com/blog/whisperkit
MIT License
3.17k stars 268 forks source link

First token logProb thresholding #90

Closed jkrukowski closed 6 months ago

jkrukowski commented 6 months ago

Should resolve https://github.com/argmaxinc/WhisperKit/issues/37

I did some testing and looks like the best value for first-token-log-prob-threshold is -0.7. I tested it using the following audio taken from https://github.com/openai/whisper/discussions/1873

To replicate the results, you can run the following commands:

swift run whisperkit-cli transcribe --model-path openai_whisper-large-v3 --audio-path 3min-kitchen-audio.mp3 --temperature-fallback-count 0 --first-token-log-prob-threshold=-1.0 --verbose

and

swift run whisperkit-cli transcribe --model-path openai_whisper-large-v3 --audio-path 3min-kitchen-audio.mp3 --temperature-fallback-count 0 --first-token-log-prob-threshold=-0.7 --verbose

The temperature-fallback-count param set to 0 is important here, if not provieded, the decoder will try to sample again with a higher temperature if the first token is not a valid one.

Running the 1st one will give you the following (or similar) output:

[WhisperKit] [0.00 --> 30.00] <|startoftranscript|><|nospeech|><|endoftext|>
[WhisperKit] [30.00 --> 34.00] <|startoftranscript|><|jw|><|translate|><|0.00|> Add the chicken and stir-fry for 2 minutes<|4.00|>
[WhisperKit] [34.00 --> 38.00] <|4.00|> Add the chicken and stir-fry for 2 minutes<|8.00|>
[WhisperKit] [38.00 --> 42.00] <|8.00|> Add the chicken and stir-fry for 2 minutes<|12.00|>
[WhisperKit] [42.00 --> 46.00] <|12.00|> Add the chicken and stir-fry for 2 minutes<|16.00|>
[WhisperKit] [46.00 --> 50.00] <|16.00|> Add the chicken and stir-fry for 2 minutes<|20.00|>
[WhisperKit] [50.00 --> 54.00] <|20.00|> Add the chicken and stir-fry for 2 minutes<|24.00|>
[WhisperKit] [54.00 --> 58.00] <|24.00|> Add the chicken and stir-fry for 2 minutes<|28.00|>
[WhisperKit] [58.00 --> 88.00] <|startoftranscript|><|nospeech|><|endoftext|>
[WhisperKit] [88.00 --> 118.00] <|startoftranscript|><|nospeech|><|endoftext|>
[WhisperKit] [118.00 --> 148.00] <|startoftranscript|><|nospeech|><|endoftext|>
[WhisperKit] [148.00 --> 178.00] <|startoftranscript|><|nospeech|><|endoftext|>
[WhisperKit] [178.00 --> 180.00] <|startoftranscript|><|endoftext|>

Running the 2nd one will give you the following (or similar) output:

[WhisperKit] [0.00 --> 30.00] <|startoftranscript|><|nospeech|><|endoftext|>
[WhisperKit] [30.00 --> 60.00] <|startoftranscript|><|endoftext|>
[WhisperKit] [60.00 --> 90.00] <|startoftranscript|><|nospeech|><|endoftext|>
[WhisperKit] [90.00 --> 120.00] <|startoftranscript|><|endoftext|>
[WhisperKit] [120.00 --> 150.00] <|startoftranscript|><|nospeech|><|endoftext|>
[WhisperKit] [150.00 --> 180.00] <|startoftranscript|><|nospeech|><|endoftext|>
ZachNagengast commented 6 months ago

@jkrukowski I think the -1.0 default is fine unless explicitly set. Before reviewing can you write a brief paragraph about how this logic is intended to work?

atiorh commented 6 months ago

Great work, @jkrukowski!