NVIDIA / NeMo

A scalable generative AI framework built for researchers and developers working on Large Language Models, Multimodal, and Speech AI (Automatic Speech Recognition and Text-to-Speech)
https://docs.nvidia.com/nemo-framework/user-guide/latest/overview.html
Apache License 2.0
11.45k stars 2.39k forks source link

Failed to generate timestamp for nvidia/parakeet-tdt-1.1b #8451

Closed leohuang2013 closed 3 months ago

leohuang2013 commented 6 months ago

Describe the bug

When I tried to generate timestamp with model: nvidia/parakeet-tdt-1.1b, I got following error, ValueError: char_offsets: [{'char': [tensor(607, dtype=torch.int32)], 'start_offset': 28, 'end_offset': 29}....

call stack,

Traceback (most recent call last):
  File "/tmp/inference/nvidia_asr.py", line 103, in <module>
    main()
  File "/tmp/inference/nvidia_asr.py", line 94, in main
    tt = parakeet_rnnt( audio, 'tdt' )
  File "/tmp/inference/nvidia_asr.py", line 45, in parakeet_rnnt
    hypothesis = asr_model.transcribe([audio], return_hypotheses=True)[0][0]
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/nemo/collections/asr/models/rnnt_models.py", line 298, in transcribe
    best_hyp, all_hyp = self.decoding.rnnt_decoder_predictions_tensor(
  File "/usr/local/lib/python3.10/dist-packages/nemo/collections/asr/metrics/rnnt_wer.py", line 497, in rnnt_decoder_predictions_tensor
    hypotheses[hyp_idx] = self.compute_rnnt_timestamps(hypotheses[hyp_idx], timestamp_type)
  File "/usr/local/lib/python3.10/dist-packages/nemo/collections/asr/metrics/rnnt_wer.py", line 699, in compute_rnnt_timestamps
    raise ValueError(

Steps/Code to reproduce bug The code to reproduce above the bug, (The code below can be used to get timestamp if use parakeet rnnt-1.1b model )

asr_model = nemo_asr.models.ASRModel.from_pretrained("nvidia/parakeet-tdt-1.1b")
decoding_cfg = asr_model.cfg.decoding
with open_dict(decoding_cfg):
    decoding_cfg.preserve_alignments = True
    decoding_cfg.compute_timestamps = True
    decoding_cfg.rnnt_timestamp_type = 'word'
asr_model.change_decoding_strategy(decoding_cfg)
hypothesis = asr_model.transcribe([audio], return_hypotheses=True)[0][0]
timestamp_dict = hypothesis.timestep
word_timestamps = timestamp_dict['word']
print(word_timestamps)

Expected behavior It should output word timestamps instead of exception.

Environment overview (please complete the following information)

Environment details

If NVIDIA docker image is used you don't need to specify these. Otherwise, please provide:

Additional context

Add any other context about the problem here. GPU model: GTX 1080T

isaac-mcfadyen commented 6 months ago

Also seeing this error.

My temporary workaround is to catch ValueErrors and just add a second or two of blank audio to the end of the file before re-processing which seems to work as a temporary stop-gap until this can be fixed.

github-actions[bot] commented 5 months ago

This issue is stale because it has been open for 30 days with no activity. Remove stale label or comment or this will be closed in 7 days.

bradmurray-dt commented 5 months ago

Adding comment to prevent this issue from closing.

github-actions[bot] commented 3 months ago

This issue is stale because it has been open for 30 days with no activity. Remove stale label or comment or this will be closed in 7 days.

github-actions[bot] commented 3 months ago

This issue was closed because it has been inactive for 7 days since being marked as stale.

anshulwadhawan commented 2 months ago

@bradmurray-dt Still facing this issue with 'parakeet-tdt-1.1b' and 'parakeet-tdt-ctc-1.1b':

  File "/lib/python3.10/site-packages/nemo/collections/asr/parts/submodules/rnnt_decoding.py", line 510, in rnnt_decoder_predictions_tensor
    hypotheses[hyp_idx] = self.compute_rnnt_timestamps(hypotheses[hyp_idx], timestamp_type)
  File "/lib/python3.10/site-packages/nemo/collections/asr/parts/submodules/rnnt_decoding.py", line 753, in compute_rnnt_timestamps
    raise ValueError(
ValueError: `char_offsets`: [{'char': [tensor(386, dtype=torch.int32)], 'start_offset': 2, 'end_offset': 3},.....
have to be of the same length, but are: `len(offsets)`: 102 and `len(processed_tokens)`: 103