NVIDIA / NeMo

A scalable generative AI framework built for researchers and developers working on Large Language Models, Multimodal, and Speech AI (Automatic Speech Recognition and Text-to-Speech)
https://docs.nvidia.com/nemo-framework/user-guide/latest/overview.html
Apache License 2.0
11.51k stars 2.41k forks source link

STT_EN_FASTCONFORMER_TRANSDUCER_XLARG - Throws error for Tensor + List operation in confidence calculation #10066

Open Vladi-SmartAssets opened 1 month ago

Vladi-SmartAssets commented 1 month ago

Describe the bug

TypeError: unsupported operand type(s) for +: 'Tensor' and 'list', occurs when wanting to extract the confidence levels for the STT FastConformer model.

Steps/Code to reproduce bug

`class NemoModel(HuggingFaceBaseModel):

def __init__(self, model_name, model_path):
    super().__init__(model_name)
    self.model_path = model_path
    self.model = None

def load_model(self):
    self.model = nemo_asr.models.EncDecRNNTBPEModel.restore_from(self.model_path, map_location="mps")

def predict(self, input_paths):

    confidence_cfg = ConfidenceConfig(
        preserve_frame_confidence=True,  # Internally set to true if preserve_token_confidence == True
        # or preserve_word_confidence == True
        preserve_token_confidence=True,  # Internally set to true if preserve_word_confidence == True
        preserve_word_confidence=True,
        aggregation="prod",  # How to aggregate frame scores to token scores and token scores to word scores
        exclude_blank=False,  # If true, only non-blank emissions contribute to confidence scores
        tdt_include_duration=False,  # If true, calculate duration confidence for the TDT models
        method_cfg=ConfidenceMethodConfig(  # Config for per-frame scores calculation (before aggregation)
            name="max_prob",  # Or "entropy" (default), which usually works better
            entropy_type="gibbs",  # Used only for name == "entropy". Recommended: "tsallis" (default) or "renyi"
            alpha=0.5,  # Low values (<1) increase sensitivity, high values decrease sensitivity
            entropy_norm="lin",  # How to normalize (map to [0,1]) entropy. Default: "exp"
        ),
    )
    self.model.change_decoding_strategy(RNNTDecodingConfig(fused_batch_size=-1, strategy="greedy_batch", confidence_cfg=confidence_cfg))

    transcriptions = self.model.transcribe(
        audio=input_paths, return_hypotheses=True
    )

    fastconformer_transcriptions = [x for x in transcriptions][0]

    return fastconformer_transcriptions

` This when run with model.transcribe will throw the following error:

for ts, te in zip(hyp.timestep, hyp.timestep[1:] + [len(hyp.frame_confidence)]): TypeError: unsupported operand type(s) for +: 'Tensor' and 'list'

Expected behavior

Expected behaviour is for the zip function to take the torch.tensor not the list, as hyp.timestamps is a Tensor and hyp.frame_confidence is a list of tensors. (Tensor[float], List[Tensor[float]]

Environment overview (please complete the following information)

Environment details

If NVIDIA docker image is used you don't need to specify these. Otherwise, please provide:

Additional context

Add any other context about the problem here. Example: Using MPS

Proposed solution: replace the following line 633 in nemo/collections/asr/parts/submodules/rnnt_decoding.py:

for ts, te in zip(hyp.timestep, hyp.timestep[1:] + [len(hyp.frame_confidence)]):

with

for ts, te in zip(hyp.timestep, hyp.timestep[1:] + len(hyp.frame_confidence)):

GNroy commented 3 weeks ago

@Vladi-SmartAssets Hi, I cannot reproduce the issue in the latest main. What NeMo version are you using?