Open aanchan opened 1 week ago
Current implementation for streaming decoding is based on this Streaming tutorial: https://colab.research.google.com/github/NVIDIA/NeMo/blob/main/tutorials/asr/Streaming_ASR.ipynb. This seems to be based on greedy decoding extracting logits from every chunk.
The transcription with confidence scores is based more on the model.transcribe method and configs that are passed to that.
Buffered inference CTC with config: https://github.com/NVIDIA/NeMo/blob/main/examples/asr/asr_chunked_inference/ctc/speech_to_text_buffered_infer_ctc.py
Some previous digging into WhisperX is here: Also I looked into confidence score estimation.
At the most basic level WhisperX does a character-level output using Wav2vec2 models. The transcription itself is obtained from Whisper. This character-level output shows up as a parallel path called "phoneme models" in this diagram. These aren't quite phonemes, but letters Typically there are something like 29 characters (a-z and special characters like ' and |). For each time step a 29-character vector consisting of log-probabilities is taken from this wav2vec2. All of these 29 probabilities in this vector sum to 1. The transcription from Whisper is then aligned to these frame-level phoneme outputs using dynamic time warping (this is given in some detail here). This essentially is useful for getting accurate word-level time-stamps. Each character in that particular word has a probability coming from the wav2vec2 "phoneme"/character model. They then do something cheeky to get the word-level confidence - they just average all the character level probabilities within that time window in this line.
Now you might ask, is that right? The answer is - I do not think so. So there is some literature explaining why confidence score estimation is a problem. Here are a few references: