facebookresearch / fairseq

Facebook AI Research Sequence-to-Sequence Toolkit written in Python.
MIT License
30.55k stars 6.41k forks source link

Enabling word-level timestamps for all W2L Decoders #5403

Open abarcovschi opened 11 months ago

abarcovschi commented 11 months ago

Before submitting

What does this PR do?

Fixes #3371 and extends #3627 to include the ability to return the frame numbers of all non-blank characters of a hypothesis for all wav2letter decoder classes, not only just for W2lKenLMDecoder. A method called get_symbols() was also added to the parent class for all the decoders (W2lDecoder) so that the non-blank characters of the hypothesis can be returned as a list of natural language characters and not just token ids. This helps in finding the word-boundary tokens later when calculating the word-level timestamp information using the following formula:

timestamp = frame_num (audio_len / (num_frames sample_rate))

where:

PR review

@alexeib