added new parameter firstTokenLogProbThreshold to DecodingOptions and CLIArguments
added condition in TextDecoderdecodeText method to check if the first token log prob is less than the threshold and if so, break the decoding loop and return the DecodingResult together with DecodingFallback property
To replicate the results, you can run the following commands:
swift run whisperkit-cli transcribe --model-path openai_whisper-large-v3 --audio-path 3min-kitchen-audio.mp3 --temperature-fallback-count 0 --first-token-log-prob-threshold=-1.0 --verbose
and
swift run whisperkit-cli transcribe --model-path openai_whisper-large-v3 --audio-path 3min-kitchen-audio.mp3 --temperature-fallback-count 0 --first-token-log-prob-threshold=-0.7 --verbose
The temperature-fallback-count param set to 0 is important here, if not provieded, the decoder will try to sample again with a higher temperature if the first token is not a valid one.
Running the 1st one will give you the following (or similar) output:
[WhisperKit] [0.00 --> 30.00] <|startoftranscript|><|nospeech|><|endoftext|>
[WhisperKit] [30.00 --> 34.00] <|startoftranscript|><|jw|><|translate|><|0.00|> Add the chicken and stir-fry for 2 minutes<|4.00|>
[WhisperKit] [34.00 --> 38.00] <|4.00|> Add the chicken and stir-fry for 2 minutes<|8.00|>
[WhisperKit] [38.00 --> 42.00] <|8.00|> Add the chicken and stir-fry for 2 minutes<|12.00|>
[WhisperKit] [42.00 --> 46.00] <|12.00|> Add the chicken and stir-fry for 2 minutes<|16.00|>
[WhisperKit] [46.00 --> 50.00] <|16.00|> Add the chicken and stir-fry for 2 minutes<|20.00|>
[WhisperKit] [50.00 --> 54.00] <|20.00|> Add the chicken and stir-fry for 2 minutes<|24.00|>
[WhisperKit] [54.00 --> 58.00] <|24.00|> Add the chicken and stir-fry for 2 minutes<|28.00|>
[WhisperKit] [58.00 --> 88.00] <|startoftranscript|><|nospeech|><|endoftext|>
[WhisperKit] [88.00 --> 118.00] <|startoftranscript|><|nospeech|><|endoftext|>
[WhisperKit] [118.00 --> 148.00] <|startoftranscript|><|nospeech|><|endoftext|>
[WhisperKit] [148.00 --> 178.00] <|startoftranscript|><|nospeech|><|endoftext|>
[WhisperKit] [178.00 --> 180.00] <|startoftranscript|><|endoftext|>
Running the 2nd one will give you the following (or similar) output:
@jkrukowski I think the -1.0 default is fine unless explicitly set. Before reviewing can you write a brief paragraph about how this logic is intended to work?
Should resolve https://github.com/argmaxinc/WhisperKit/issues/37
firstTokenLogProbThreshold
toDecodingOptions
andCLIArguments
TextDecoder
decodeText
method to check if the first token log prob is less than the threshold and if so, break the decoding loop and return theDecodingResult
together withDecodingFallback
propertyI did some testing and looks like the best value for
first-token-log-prob-threshold
is -0.7. I tested it using the following audio taken from https://github.com/openai/whisper/discussions/1873To replicate the results, you can run the following commands:
and
The
temperature-fallback-count
param set to 0 is important here, if not provieded, the decoder will try to sample again with a higher temperature if the first token is not a valid one.Running the 1st one will give you the following (or similar) output:
Running the 2nd one will give you the following (or similar) output: