First token logProb thresholding

jkrukowski commented 6 months ago

Should resolve https://github.com/argmaxinc/WhisperKit/issues/37

added new parameter firstTokenLogProbThreshold to DecodingOptions and CLIArguments
added condition in TextDecoder decodeText method to check if the first token log prob is less than the threshold and if so, break the decoding loop and return the DecodingResult together with DecodingFallback property
added tests

I did some testing and looks like the best value for first-token-log-prob-threshold is -0.7. I tested it using the following audio taken from https://github.com/openai/whisper/discussions/1873

To replicate the results, you can run the following commands:

swift run whisperkit-cli transcribe --model-path openai_whisper-large-v3 --audio-path 3min-kitchen-audio.mp3 --temperature-fallback-count 0 --first-token-log-prob-threshold=-1.0 --verbose

and

swift run whisperkit-cli transcribe --model-path openai_whisper-large-v3 --audio-path 3min-kitchen-audio.mp3 --temperature-fallback-count 0 --first-token-log-prob-threshold=-0.7 --verbose

The temperature-fallback-count param set to 0 is important here, if not provieded, the decoder will try to sample again with a higher temperature if the first token is not a valid one.

Running the 1st one will give you the following (or similar) output:

[WhisperKit] [0.00 --> 30.00] <|startoftranscript|><|nospeech|><|endoftext|>
[WhisperKit] [30.00 --> 34.00] <|startoftranscript|><|jw|><|translate|><|0.00|> Add the chicken and stir-fry for 2 minutes<|4.00|>
[WhisperKit] [34.00 --> 38.00] <|4.00|> Add the chicken and stir-fry for 2 minutes<|8.00|>
[WhisperKit] [38.00 --> 42.00] <|8.00|> Add the chicken and stir-fry for 2 minutes<|12.00|>
[WhisperKit] [42.00 --> 46.00] <|12.00|> Add the chicken and stir-fry for 2 minutes<|16.00|>
[WhisperKit] [46.00 --> 50.00] <|16.00|> Add the chicken and stir-fry for 2 minutes<|20.00|>
[WhisperKit] [50.00 --> 54.00] <|20.00|> Add the chicken and stir-fry for 2 minutes<|24.00|>
[WhisperKit] [54.00 --> 58.00] <|24.00|> Add the chicken and stir-fry for 2 minutes<|28.00|>
[WhisperKit] [58.00 --> 88.00] <|startoftranscript|><|nospeech|><|endoftext|>
[WhisperKit] [88.00 --> 118.00] <|startoftranscript|><|nospeech|><|endoftext|>
[WhisperKit] [118.00 --> 148.00] <|startoftranscript|><|nospeech|><|endoftext|>
[WhisperKit] [148.00 --> 178.00] <|startoftranscript|><|nospeech|><|endoftext|>
[WhisperKit] [178.00 --> 180.00] <|startoftranscript|><|endoftext|>

Running the 2nd one will give you the following (or similar) output:

[WhisperKit] [0.00 --> 30.00] <|startoftranscript|><|nospeech|><|endoftext|>
[WhisperKit] [30.00 --> 60.00] <|startoftranscript|><|endoftext|>
[WhisperKit] [60.00 --> 90.00] <|startoftranscript|><|nospeech|><|endoftext|>
[WhisperKit] [90.00 --> 120.00] <|startoftranscript|><|endoftext|>
[WhisperKit] [120.00 --> 150.00] <|startoftranscript|><|nospeech|><|endoftext|>
[WhisperKit] [150.00 --> 180.00] <|startoftranscript|><|nospeech|><|endoftext|>

ZachNagengast commented 6 months ago

@jkrukowski I think the -1.0 default is fine unless explicitly set. Before reviewing can you write a brief paragraph about how this logic is intended to work?

atiorh commented 6 months ago

Great work, @jkrukowski!

argmaxinc / WhisperKit

First token logProb thresholding #90