A scalable generative AI framework built for researchers and developers working on Large Language Models, Multimodal, and Speech AI (Automatic Speech Recognition and Text-to-Speech)
TypeError: unsupported operand type(s) for +: 'Tensor' and 'list', occurs when wanting to extract the confidence levels for the STT FastConformer model.
Steps/Code to reproduce bug
`class NemoModel(HuggingFaceBaseModel):
def __init__(self, model_name, model_path):
super().__init__(model_name)
self.model_path = model_path
self.model = None
def load_model(self):
self.model = nemo_asr.models.EncDecRNNTBPEModel.restore_from(self.model_path, map_location="mps")
def predict(self, input_paths):
confidence_cfg = ConfidenceConfig(
preserve_frame_confidence=True, # Internally set to true if preserve_token_confidence == True
# or preserve_word_confidence == True
preserve_token_confidence=True, # Internally set to true if preserve_word_confidence == True
preserve_word_confidence=True,
aggregation="prod", # How to aggregate frame scores to token scores and token scores to word scores
exclude_blank=False, # If true, only non-blank emissions contribute to confidence scores
tdt_include_duration=False, # If true, calculate duration confidence for the TDT models
method_cfg=ConfidenceMethodConfig( # Config for per-frame scores calculation (before aggregation)
name="max_prob", # Or "entropy" (default), which usually works better
entropy_type="gibbs", # Used only for name == "entropy". Recommended: "tsallis" (default) or "renyi"
alpha=0.5, # Low values (<1) increase sensitivity, high values decrease sensitivity
entropy_norm="lin", # How to normalize (map to [0,1]) entropy. Default: "exp"
),
)
self.model.change_decoding_strategy(RNNTDecodingConfig(fused_batch_size=-1, strategy="greedy_batch", confidence_cfg=confidence_cfg))
transcriptions = self.model.transcribe(
audio=input_paths, return_hypotheses=True
)
fastconformer_transcriptions = [x for x in transcriptions][0]
return fastconformer_transcriptions
`
This when run with model.transcribe will throw the following error:
for ts, te in zip(hyp.timestep, hyp.timestep[1:] + [len(hyp.frame_confidence)]): TypeError: unsupported operand type(s) for +: 'Tensor' and 'list'
Expected behavior
Expected behaviour is for the zip function to take the torch.tensor not the list, as hyp.timestamps is a Tensor and hyp.frame_confidence is a list of tensors. (Tensor[float], List[Tensor[float]]
Environment overview (please complete the following information)
Environment location: Docker
Method of NeMo install: pip install nemo
Environment details
If NVIDIA docker image is used you don't need to specify these.
Otherwise, please provide:
OS version: MacOS 14.5 (23F79)
PyTorch version: 2.3.1
Python version: 3.10
Additional context
Add any other context about the problem here.
Example: Using MPS
Proposed solution:
replace the following line 633 in nemo/collections/asr/parts/submodules/rnnt_decoding.py:
for ts, te in zip(hyp.timestep, hyp.timestep[1:] + [len(hyp.frame_confidence)]):
with
for ts, te in zip(hyp.timestep, hyp.timestep[1:] + len(hyp.frame_confidence)):
Describe the bug
TypeError: unsupported operand type(s) for +: 'Tensor' and 'list', occurs when wanting to extract the confidence levels for the STT FastConformer model.
Steps/Code to reproduce bug
`class NemoModel(HuggingFaceBaseModel):
` This when run with model.transcribe will throw the following error:
for ts, te in zip(hyp.timestep, hyp.timestep[1:] + [len(hyp.frame_confidence)]): TypeError: unsupported operand type(s) for +: 'Tensor' and 'list'
Expected behavior
Expected behaviour is for the zip function to take the torch.tensor not the list, as hyp.timestamps is a Tensor and hyp.frame_confidence is a list of tensors. (Tensor[float], List[Tensor[float]]
Environment overview (please complete the following information)
Environment details
If NVIDIA docker image is used you don't need to specify these. Otherwise, please provide:
Additional context
Add any other context about the problem here. Example: Using MPS
Proposed solution: replace the following line 633 in nemo/collections/asr/parts/submodules/rnnt_decoding.py:
for ts, te in zip(hyp.timestep, hyp.timestep[1:] + [len(hyp.frame_confidence)]):
with
for ts, te in zip(hyp.timestep, hyp.timestep[1:] + len(hyp.frame_confidence)):