NVIDIA / NeMo

A scalable generative AI framework built for researchers and developers working on Large Language Models, Multimodal, and Speech AI (Automatic Speech Recognition and Text-to-Speech)
https://docs.nvidia.com/nemo-framework/user-guide/latest/overview.html
Apache License 2.0
12.01k stars 2.5k forks source link

Internal error when running model.transcribe() on FastConformer-Hybrid-Transducer-CTC-BPE model. #9598

Closed MarcisTU closed 1 month ago

MarcisTU commented 4 months ago

Describe the bug

After following installation instructions for Linux there is a weird bug about torch tensor not being on CPU when converting to numpy.

Transcribing:   0%|                                                                                                                          | 0/1 [00:02<?, ?it/s]
Traceback (most recent call last):
  File "/root/inference_fastconformer_stt_lv.py", line 29, in <module>
    hypotheses = asr_model.transcribe(input_wav, return_hypotheses=True)
  File "/opt/conda/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/nemo/collections/asr/models/hybrid_rnnt_ctc_models.py", line 138, in transcribe
    return super().transcribe(
  File "/opt/conda/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/nemo/collections/asr/models/rnnt_models.py", line 279, in transcribe
    return super().transcribe(
  File "/opt/conda/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/nemo/collections/asr/parts/mixins/transcription.py", line 274, in transcribe
    for processed_outputs in generator:
  File "/opt/conda/lib/python3.10/site-packages/nemo/collections/asr/parts/mixins/transcription.py", line 385, in transcribe_generator
    processed_outputs = self._transcribe_output_processing(model_outputs, transcribe_cfg)
  File "/opt/conda/lib/python3.10/site-packages/nemo/collections/asr/models/hybrid_rnnt_ctc_models.py", line 179, in _transcribe_output_processing
    return super()._transcribe_output_processing(outputs, trcfg)
  File "/opt/conda/lib/python3.10/site-packages/nemo/collections/asr/models/rnnt_models.py", line 908, in _transcribe_output_processing
    best_hyp, all_hyp = self.decoding.rnnt_decoder_predictions_tensor(
  File "/opt/conda/lib/python3.10/site-packages/nemo/collections/asr/parts/submodules/rnnt_decoding.py", line 510, in rnnt_decoder_predictions_tensor
    hypotheses[hyp_idx] = self.compute_rnnt_timestamps(hypotheses[hyp_idx], timestamp_type)
  File "/opt/conda/lib/python3.10/site-packages/nemo/collections/asr/parts/submodules/rnnt_decoding.py", line 741, in compute_rnnt_timestamps
    char_offsets = self._compute_offsets(hypothesis, token_repetitions, self.blank_id)
  File "/opt/conda/lib/python3.10/site-packages/nemo/collections/asr/parts/submodules/rnnt_decoding.py", line 851, in _compute_offsets
    start_indices = np.concatenate(([start_index], end_indices[:-1]))
  File "/opt/conda/lib/python3.10/site-packages/torch/_tensor.py", line 1062, in __array__
    return self.numpy()
TypeError: can't convert cuda:0 device type tensor to numpy. Use Tensor.cpu() to copy the tensor to host memory first.

Steps/Code to reproduce bug

Code to reproduce:

import copy
import torch
import librosa
from omegaconf import OmegaConf, open_dict
from nemo.collections.asr.models import EncDecHybridRNNTCTCBPEModel

if __name__ == "__main__":
    asr_model_path = "./nemo_experiments/FastConformer-Hybrid-Transducer-CTC-BPE/2024-07-02_09-14-19/checkpoints/FastConformer-Hybrid-Transducer-CTC-BPE.nemo"
    asr_model = EncDecHybridRNNTCTCBPEModel.restore_from(restore_path=asr_model_path)

    device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

    audio_path = "./test_data/2321187.wav"
    input_wav, sr = librosa.load(audio_path, sr=16000)

    decoding_cfg = copy.deepcopy(asr_model.cfg.decoding)
    with open_dict(decoding_cfg):
        decoding_cfg.preserve_alignments = True
        decoding_cfg.compute_timestamps = True
        asr_model.change_decoding_strategy(decoding_cfg)

    # specify flag `return_hypotheses=True``
    hypotheses = asr_model.transcribe(input_wav, return_hypotheses=True)

    # if hypotheses form a tuple (from RNNT), extract just "best" hypotheses
    if type(hypotheses) == tuple and len(hypotheses) == 2:
        hypotheses = hypotheses[0]

    timestamp_dict = hypotheses[0].timestep  # extract timesteps from hypothesis of first (and only) audio file
    print("Hypothesis contains following timestep information :", list(timestamp_dict.keys()))

    # For a FastConformer model, you can display the word timestamps as follows:
    # 80ms is duration of a timestep at output of the Conformer
    time_stride = 8 * asr_model.cfg.preprocessor.window_stride

    word_timestamps = timestamp_dict['word']

    for stamp in word_timestamps:
        start = stamp['start_offset'] * time_stride
        end = stamp['end_offset'] * time_stride
        word = stamp['char'] if 'char' in stamp else stamp['word']

        print(f"Time : {start:0.2f} - {end:0.2f} - {word}")

Expected behavior

Inference code runs and it is possible to get the result.

Environment overview (please complete the following information)

Environment details

If NVIDIA docker image is used you don't need to specify these. Otherwise, please provide:

Additional context Also tried to install in WSL2 windows, but got the same bug.

nithinraok commented 3 months ago

@KunalDhawan pls have a look at this issue.

KunalDhawan commented 3 months ago

Hi @MarcisTU, Thank you for the detailed description! Could you please help me reproduce the issue?

I tried to replicate the issue on my end but I am able to run the code snippet you shared above without any errors. Let me describe my replication setup in detail:

Environment: I built a fresh conda env using NeMo main

Reproducing the code:

>>> import copy
>>> import torch
>>> import librosa
>>> from omegaconf import OmegaConf, open_dict
>>> from nemo.collections.asr.models import EncDecHybridRNNTCTCBPEModel

>>> asr_model_path = "/models/stt_de_fastconformer_hybrid_large_pc.nemo"
>>> asr_model = EncDecHybridRNNTCTCBPEModel.restore_from(restore_path=asr_model_path)
.....
[NeMo I 2024-08-01 17:36:37 save_restore_connector:275] Model EncDecHybridRNNTCTCBPEModel was successfully restored from /models/stt_de_fastconformer_hybrid_large_pc.nemo.

>>> device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
>>> device
device(type='cuda')

>>> audio_path = "/data/ASR/en/librispeech/wav/test-clean/61-70968-0000.wav"
>>> input_wav, sr = librosa.load(audio_path, sr=16000)

>>> with open_dict(decoding_cfg):
...     decoding_cfg.preserve_alignments = True
...     decoding_cfg.compute_timestamps = True
...     asr_model.change_decoding_strategy(decoding_cfg)

[NeMo I 2024-08-01 17:38:17 rnnt_models:224] Using RNNT Loss : warprnnt_numba
    Loss warprnnt_numba_kwargs: {'fastemit_lambda': 0.0, 'clamp': -1.0}
[NeMo I 2024-08-01 17:38:17 hybrid_rnnt_ctc_bpe_models:457] Changed decoding strategy of the RNNT decoder to 
    model_type: rnnt
    strategy: greedy_batch
    compute_hypothesis_token_set: false
    preserve_alignments: true
    confidence_cfg:
      preserve_frame_confidence: false
      preserve_token_confidence: false
      preserve_word_confidence: false
      exclude_blank: true
      aggregation: min
      tdt_include_duration: false
      method_cfg:
        name: entropy
        entropy_type: tsallis
        alpha: 0.33
        entropy_norm: exp
        temperature: DEPRECATED
    fused_batch_size: null
    compute_timestamps: true
    compute_langs: false
    word_seperator: ' '
    rnnt_timestamp_type: all
    greedy:
      max_symbols_per_step: 10
      preserve_alignments: false
      preserve_frame_confidence: false
      tdt_include_duration_confidence: false
      confidence_method_cfg:
        name: entropy
        entropy_type: tsallis
        alpha: 0.33
        entropy_norm: exp
        temperature: DEPRECATED
      loop_labels: true
      use_cuda_graph_decoder: true
      max_symbols: 10
    beam:
      beam_size: 2
      search_type: default
      score_norm: true
      return_best_hypothesis: false
      tsd_max_sym_exp_per_step: 50
      alsd_max_target_len: 2.0
      nsc_max_timesteps_expansion: 1
      nsc_prefix_alpha: 1
      maes_num_steps: 2
      maes_prefix_alpha: 1
      maes_expansion_gamma: 2.3
      maes_expansion_beta: 2
      language_model: null
      softmax_temperature: 1.0
      preserve_alignments: false
      ngram_lm_model: null
      ngram_lm_alpha: 0.0
      hat_subtract_ilm: false
      hat_ilm_weight: 0.0
      tsd_max_sym_exp: 50
    temperature: 1.0
    durations: []
    big_blank_durations: []

>>> hypotheses = asr_model.transcribe(input_wav, return_hypotheses=True)
Transcribing: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:01<00:00,  1.45s/it]

>>> hypotheses
([Hypothesis(score=-26.476362228393555, y_sequence=tensor([   5, 1004,  354,    6,    5,   91,  255,  255,   38,   21,  166,    8,
           5,  255,   22,   15,   90,   28,  621,    5,   91,  354,   38,   16,
           5,    8,  121,    5,  153,   43,   43,    9,    8,   50,   22,    5,
         482,    3,    8,  236,  199,   22,  114,    4,  551,   95,    8,   38,
          10,    5,    8,  121,    5,  247,  291,    2], device='cuda:0'), text='Ygan accinfust complant against the wizzertwo identisch boye Curtin und the loft .', dec_out=None, dec_state=(tensor([[[ 1.1207e-03,  2.4281e-03,  3.4973e-08, -7.2756e-01, -3.1125e-05,
          -3.0149e-04,  1.0904e-05, -9.7909e-06, -1.3529e-04,  5.1211e-05,
           2.7526e-04, -7.5015e-01, -7.4380e-01,  4.6821e-02,  3.1424e-10,
          -1.6009e-07,  7.6155e-01, -9.2109e-07,  6.8400e-05, -7.5959e-01,
           7.5943e-01, -7.5767e-01,  2.4640e-06,  2.4134e-05, -2.8190e-03,
           ........

I was able to transcribe with a FastConformer-Hybrid-Transducer-CTC-BPE model without any issues. Could you please share some more details and kindly help me identify where there could be a possible mismatch between my replication and your setup?

github-actions[bot] commented 2 months ago

This issue is stale because it has been open for 30 days with no activity. Remove stale label or comment or this will be closed in 7 days.

github-actions[bot] commented 1 month ago

This issue was closed because it has been inactive for 7 days since being marked as stale.