Incorporate scores into nvidia nemo

aanchan commented 1 week ago

Incorporate ASR confidence estimation scores with ConfidenceConfig in NVIDIA NeMO : https://github.com/NVIDIA/NeMo/blob/main/tutorials/asr/ASR_Confidence_Estimation.ipynb
Investigate incorporating scores into WhisperX, it is likely that current score estimation is wrong in WhisperX

Some previous digging into WhisperX is here: Also I looked into confidence score estimation.

At the most basic level WhisperX does a character-level output using Wav2vec2 models. The transcription itself is obtained from Whisper. This character-level output shows up as a parallel path called "phoneme models" in this diagram. These aren't quite phonemes, but letters Typically there are something like 29 characters (a-z and special characters like ' and |). For each time step a 29-character vector consisting of log-probabilities is taken from this wav2vec2. All of these 29 probabilities in this vector sum to 1. The transcription from Whisper is then aligned to these frame-level phoneme outputs using dynamic time warping (this is given in some detail here). This essentially is useful for getting accurate word-level time-stamps. Each character in that particular word has a probability coming from the wav2vec2 "phoneme"/character model. They then do something cheeky to get the word-level confidence - they just average all the character level probabilities within that time window in this line.

Now you might ask, is that right? The answer is - I do not think so. So there is some literature explaining why confidence score estimation is a problem. Here are a few references:

I would recommend previewing the introduction section of this paper to understand the problem. The solution they provide I think is a more accurate way of calculating the denominator p(x) by modifying the forward backward algorithm (this is from the HMM era)
Reference 26 from the paper above is this paper which uses an external source of knowledge like an external phoneme recognizer to normalize the acoustic and language model likelihood scores to provide an approximation to p(x)
This is a more recent paper from Apple using something called word confusion networks. This seems far from a solved problem!

aanchan commented 5 days ago

Current implementation for streaming decoding is based on this Streaming tutorial: https://colab.research.google.com/github/NVIDIA/NeMo/blob/main/tutorials/asr/Streaming_ASR.ipynb. This seems to be based on greedy decoding extracting logits from every chunk.

The transcription with confidence scores is based more on the model.transcribe method and configs that are passed to that.

aanchan commented 2 days ago

Buffered inference CTC with config: https://github.com/NVIDIA/NeMo/blob/main/examples/asr/asr_chunked_inference/ctc/speech_to_text_buffered_infer_ctc.py

SlangLab-NU / podcast_transcription_using_nemo

Incorporate scores into nvidia nemo #9