Fixes #3371 and extends #3627 to include the ability to return the frame numbers of all non-blank characters of a hypothesis for all wav2letter decoder classes, not only just for W2lKenLMDecoder. A method called get_symbols() was also added to the parent class for all the decoders (W2lDecoder) so that the non-blank characters of the hypothesis can be returned as a list of natural language characters and not just token ids. This helps in finding the word-boundary tokens later when calculating the word-level timestamp information using the following formula:
frame_num = the timestep of the symbol, as returned in the 'timesteps' field of Wl2Decoder.decode() outputs.
audio_len = the number of samples in the loaded audio file corresponding to the transcript (if using batched w2v2 acoustic model inference, will be zero padded to the length of the longest loaded audio file in the batch).
num_frames = the number of frames in the emission matrix returned by the w2v2 acoustic model inference for that audio file (if using batched inference, the number of frames for each audio file will be the same as in this case all loaded audio files are padded to the length of the longest audio file in the batch).
Before submitting
What does this PR do?
Fixes #3371 and extends #3627 to include the ability to return the frame numbers of all non-blank characters of a hypothesis for all wav2letter decoder classes, not only just for
W2lKenLMDecoder
. A method calledget_symbols()
was also added to the parent class for all the decoders (W2lDecoder
) so that the non-blank characters of the hypothesis can be returned as a list of natural language characters and not just token ids. This helps in finding the word-boundary tokens later when calculating the word-level timestamp information using the following formula:timestamp = frame_num (audio_len / (num_frames sample_rate))
where:
PR review
@alexeib