Internal error when running model.transcribe() on FastConformer-Hybrid-Transducer-CTC-BPE model.

MarcisTU commented 4 months ago

Describe the bug

After following installation instructions for Linux there is a weird bug about torch tensor not being on CPU when converting to numpy.

Transcribing:   0%|                                                                                                                          | 0/1 [00:02<?, ?it/s]
Traceback (most recent call last):
  File "/root/inference_fastconformer_stt_lv.py", line 29, in <module>
    hypotheses = asr_model.transcribe(input_wav, return_hypotheses=True)
  File "/opt/conda/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/nemo/collections/asr/models/hybrid_rnnt_ctc_models.py", line 138, in transcribe
    return super().transcribe(
  File "/opt/conda/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/nemo/collections/asr/models/rnnt_models.py", line 279, in transcribe
    return super().transcribe(
  File "/opt/conda/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/nemo/collections/asr/parts/mixins/transcription.py", line 274, in transcribe
    for processed_outputs in generator:
  File "/opt/conda/lib/python3.10/site-packages/nemo/collections/asr/parts/mixins/transcription.py", line 385, in transcribe_generator
    processed_outputs = self._transcribe_output_processing(model_outputs, transcribe_cfg)
  File "/opt/conda/lib/python3.10/site-packages/nemo/collections/asr/models/hybrid_rnnt_ctc_models.py", line 179, in _transcribe_output_processing
    return super()._transcribe_output_processing(outputs, trcfg)
  File "/opt/conda/lib/python3.10/site-packages/nemo/collections/asr/models/rnnt_models.py", line 908, in _transcribe_output_processing
    best_hyp, all_hyp = self.decoding.rnnt_decoder_predictions_tensor(
  File "/opt/conda/lib/python3.10/site-packages/nemo/collections/asr/parts/submodules/rnnt_decoding.py", line 510, in rnnt_decoder_predictions_tensor
    hypotheses[hyp_idx] = self.compute_rnnt_timestamps(hypotheses[hyp_idx], timestamp_type)
  File "/opt/conda/lib/python3.10/site-packages/nemo/collections/asr/parts/submodules/rnnt_decoding.py", line 741, in compute_rnnt_timestamps
    char_offsets = self._compute_offsets(hypothesis, token_repetitions, self.blank_id)
  File "/opt/conda/lib/python3.10/site-packages/nemo/collections/asr/parts/submodules/rnnt_decoding.py", line 851, in _compute_offsets
    start_indices = np.concatenate(([start_index], end_indices[:-1]))
  File "/opt/conda/lib/python3.10/site-packages/torch/_tensor.py", line 1062, in __array__
    return self.numpy()
TypeError: can't convert cuda:0 device type tensor to numpy. Use Tensor.cpu() to copy the tensor to host memory first.

Steps/Code to reproduce bug

Code to reproduce:

import copy
import torch
import librosa
from omegaconf import OmegaConf, open_dict
from nemo.collections.asr.models import EncDecHybridRNNTCTCBPEModel

if __name__ == "__main__":
    asr_model_path = "./nemo_experiments/FastConformer-Hybrid-Transducer-CTC-BPE/2024-07-02_09-14-19/checkpoints/FastConformer-Hybrid-Transducer-CTC-BPE.nemo"
    asr_model = EncDecHybridRNNTCTCBPEModel.restore_from(restore_path=asr_model_path)

    device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

    audio_path = "./test_data/2321187.wav"
    input_wav, sr = librosa.load(audio_path, sr=16000)

    decoding_cfg = copy.deepcopy(asr_model.cfg.decoding)
    with open_dict(decoding_cfg):
        decoding_cfg.preserve_alignments = True
        decoding_cfg.compute_timestamps = True
        asr_model.change_decoding_strategy(decoding_cfg)

    # specify flag `return_hypotheses=True``
    hypotheses = asr_model.transcribe(input_wav, return_hypotheses=True)

    # if hypotheses form a tuple (from RNNT), extract just "best" hypotheses
    if type(hypotheses) == tuple and len(hypotheses) == 2:
        hypotheses = hypotheses[0]

    timestamp_dict = hypotheses[0].timestep  # extract timesteps from hypothesis of first (and only) audio file
    print("Hypothesis contains following timestep information :", list(timestamp_dict.keys()))

    # For a FastConformer model, you can display the word timestamps as follows:
    # 80ms is duration of a timestep at output of the Conformer
    time_stride = 8 * asr_model.cfg.preprocessor.window_stride

    word_timestamps = timestamp_dict['word']

    for stamp in word_timestamps:
        start = stamp['start_offset'] * time_stride
        end = stamp['end_offset'] * time_stride
        word = stamp['char'] if 'char' in stamp else stamp['word']

        print(f"Time : {start:0.2f} - {end:0.2f} - {word}")

Expected behavior

Inference code runs and it is possible to get the result.

Environment overview (please complete the following information)

Environment location: Vast.ai instance.
Method of NeMo install: pip install from source.

Environment details

If NVIDIA docker image is used you don't need to specify these. Otherwise, please provide:

OS version Ubuntu 22
PyTorch version 2.2.1
Python version 3.11

Additional context Also tried to install in WSL2 windows, but got the same bug.

nithinraok commented 3 months ago

@KunalDhawan pls have a look at this issue.

KunalDhawan commented 3 months ago

Hi @MarcisTU, Thank you for the detailed description! Could you please help me reproduce the issue?

I tried to replicate the issue on my end but I am able to run the code snippet you shared above without any errors. Let me describe my replication setup in detail:

Environment: I built a fresh conda env using NeMo main

Reproducing the code:

>>> import copy
>>> import torch
>>> import librosa
>>> from omegaconf import OmegaConf, open_dict
>>> from nemo.collections.asr.models import EncDecHybridRNNTCTCBPEModel

>>> asr_model_path = "/models/stt_de_fastconformer_hybrid_large_pc.nemo"
>>> asr_model = EncDecHybridRNNTCTCBPEModel.restore_from(restore_path=asr_model_path)
.....
[NeMo I 2024-08-01 17:36:37 save_restore_connector:275] Model EncDecHybridRNNTCTCBPEModel was successfully restored from /models/stt_de_fastconformer_hybrid_large_pc.nemo.

>>> device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
>>> device
device(type='cuda')

>>> audio_path = "/data/ASR/en/librispeech/wav/test-clean/61-70968-0000.wav"
>>> input_wav, sr = librosa.load(audio_path, sr=16000)

>>> with open_dict(decoding_cfg):
...     decoding_cfg.preserve_alignments = True
...     decoding_cfg.compute_timestamps = True
...     asr_model.change_decoding_strategy(decoding_cfg)

[NeMo I 2024-08-01 17:38:17 rnnt_models:224] Using RNNT Loss : warprnnt_numba
    Loss warprnnt_numba_kwargs: {'fastemit_lambda': 0.0, 'clamp': -1.0}
[NeMo I 2024-08-01 17:38:17 hybrid_rnnt_ctc_bpe_models:457] Changed decoding strategy of the RNNT decoder to 
    model_type: rnnt
    strategy: greedy_batch
    compute_hypothesis_token_set: false
    preserve_alignments: true
    confidence_cfg:
      preserve_frame_confidence: false
      preserve_token_confidence: false
      preserve_word_confidence: false
      exclude_blank: true
      aggregation: min
      tdt_include_duration: false
      method_cfg:
        name: entropy
        entropy_type: tsallis
        alpha: 0.33
        entropy_norm: exp
        temperature: DEPRECATED
    fused_batch_size: null
    compute_timestamps: true
    compute_langs: false
    word_seperator: ' '
    rnnt_timestamp_type: all
    greedy:
      max_symbols_per_step: 10
      preserve_alignments: false
      preserve_frame_confidence: false
      tdt_include_duration_confidence: false
      confidence_method_cfg:
        name: entropy
        entropy_type: tsallis
        alpha: 0.33
        entropy_norm: exp
        temperature: DEPRECATED
      loop_labels: true
      use_cuda_graph_decoder: true
      max_symbols: 10
    beam:
      beam_size: 2
      search_type: default
      score_norm: true
      return_best_hypothesis: false
      tsd_max_sym_exp_per_step: 50
      alsd_max_target_len: 2.0
      nsc_max_timesteps_expansion: 1
      nsc_prefix_alpha: 1
      maes_num_steps: 2
      maes_prefix_alpha: 1
      maes_expansion_gamma: 2.3
      maes_expansion_beta: 2
      language_model: null
      softmax_temperature: 1.0
      preserve_alignments: false
      ngram_lm_model: null
      ngram_lm_alpha: 0.0
      hat_subtract_ilm: false
      hat_ilm_weight: 0.0
      tsd_max_sym_exp: 50
    temperature: 1.0
    durations: []
    big_blank_durations: []

>>> hypotheses = asr_model.transcribe(input_wav, return_hypotheses=True)
Transcribing: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:01<00:00,  1.45s/it]

>>> hypotheses
([Hypothesis(score=-26.476362228393555, y_sequence=tensor([   5, 1004,  354,    6,    5,   91,  255,  255,   38,   21,  166,    8,
           5,  255,   22,   15,   90,   28,  621,    5,   91,  354,   38,   16,
           5,    8,  121,    5,  153,   43,   43,    9,    8,   50,   22,    5,
         482,    3,    8,  236,  199,   22,  114,    4,  551,   95,    8,   38,
          10,    5,    8,  121,    5,  247,  291,    2], device='cuda:0'), text='Ygan accinfust complant against the wizzertwo identisch boye Curtin und the loft .', dec_out=None, dec_state=(tensor([[[ 1.1207e-03,  2.4281e-03,  3.4973e-08, -7.2756e-01, -3.1125e-05,
          -3.0149e-04,  1.0904e-05, -9.7909e-06, -1.3529e-04,  5.1211e-05,
           2.7526e-04, -7.5015e-01, -7.4380e-01,  4.6821e-02,  3.1424e-10,
          -1.6009e-07,  7.6155e-01, -9.2109e-07,  6.8400e-05, -7.5959e-01,
           7.5943e-01, -7.5767e-01,  2.4640e-06,  2.4134e-05, -2.8190e-03,
           ........

I was able to transcribe with a FastConformer-Hybrid-Transducer-CTC-BPE model without any issues. Could you please share some more details and kindly help me identify where there could be a possible mismatch between my replication and your setup?

github-actions[bot] commented 2 months ago

This issue is stale because it has been open for 30 days with no activity. Remove stale label or comment or this will be closed in 7 days.

github-actions[bot] commented 1 month ago

This issue was closed because it has been inactive for 7 days since being marked as stale.

NVIDIA / NeMo

Internal error when running model.transcribe() on FastConformer-Hybrid-Transducer-CTC-BPE model. #9598