NVIDIA / NeMo

A scalable generative AI framework built for researchers and developers working on Large Language Models, Multimodal, and Speech AI (Automatic Speech Recognition and Text-to-Speech)
https://docs.nvidia.com/nemo-framework/user-guide/latest/overview.html
Apache License 2.0
11.83k stars 2.46k forks source link

[SD + ASR] Can't use local ASR model produce word timestamp. #5546

Closed triumph9989 closed 1 year ago

triumph9989 commented 1 year ago

Hi, I want to join my ASR model into offline_diar_with_asr_infer.py and use asr_based_vad. But the following error is appearance.

Traceback (most recent call last):
  File "offline_diar_with_asr_infer.py", line 55, in main
    word_hyp, word_ts_hyp = asr_decoder_ts.run_ASR(asr_model)
TypeError: 'NoneType' object is not callable

If cfg.diarizer.asr.model_path=??? it will show another error. ValueError: `cfg` must have `tokenizer` config to create a tokenizer !

I think that is about the inconsistency between my ASR class and this task, but I'm not very sure. Is something that I have not noticed, please?

Steps/Code to reproduce bug offline_diar_with_asr_infer.py

from omegaconf import OmegaConf

from nemo.collections.asr.parts.utils.decoder_timestamps_utils import ASRDecoderTimeStamps
from nemo.collections.asr.parts.utils.diarization_utils import OfflineDiarWithASR
from nemo.core.config import hydra_runner
from nemo.utils import logging

import nemo.collections.asr as nemo_asr
@hydra_runner(config_path="../conf/inference", config_name="diar_infer_meeting.yaml")
def main(cfg):

    logging.info(f'Hydra config: {OmegaConf.to_yaml(cfg)}')

    # ASR inference for words and word timestamps
    asr_decoder_ts = ASRDecoderTimeStamps(cfg.diarizer)
    asr_model = nemo_asr.models.EncDecCTCModel.restore_from('/home/face/NeMo/examples/asr/exp/Conformer-CTC-Char-Aishell-100ep-lr0.9/2022-12-01_09-43-05/checkpoints/Conformer-CTC-Char-Aishell-100ep-lr0.9.nemo')
    #asr_model = asr_decoder_ts.set_asr_model()
    word_hyp, word_ts_hyp = asr_decoder_ts.run_ASR(asr_model)

    # Create a class instance for matching ASR and diarization results
    asr_diar_offline = OfflineDiarWithASR(cfg.diarizer)
    asr_diar_offline.word_ts_anchor_offset = asr_decoder_ts.word_ts_anchor_offset

    # Diarization inference for speaker labels
    diar_hyp, diar_score = asr_diar_offline.run_diarization(cfg, word_ts_hyp)
    trans_info_dict = asr_diar_offline.get_transcript_with_speaker_labels(diar_hyp, word_hyp, word_ts_hyp)

    # If RTTM is provided and DER evaluation
    if diar_score is not None:
        metric, mapping_dict, _ = diar_score

        # Get session-level diarization error rate and speaker counting error
        der_results = OfflineDiarWithASR.gather_eval_results(
            diar_score=diar_score,
            audio_rttm_map_dict=asr_diar_offline.AUDIO_RTTM_MAP,
            trans_info_dict=trans_info_dict,
            root_path=asr_diar_offline.root_path,
        )

        # Calculate WER and cpWER if reference CTM files exist
        wer_results = OfflineDiarWithASR.evaluate(
            hyp_trans_info_dict=trans_info_dict,
            audio_file_list=asr_diar_offline.audio_file_list,
            ref_ctm_file_list=asr_diar_offline.ctm_file_list,
        )

        # Print average DER, WER and cpWER
        OfflineDiarWithASR.print_errors(der_results=der_results, wer_results=wer_results)

        # Save detailed session-level evaluation results in `root_path`.
        OfflineDiarWithASR.write_session_level_result_in_csv(
            der_results=der_results,
            wer_results=wer_results,
            root_path=asr_diar_offline.root_path,
            csv_columns=asr_diar_offline.csv_columns,
        )

if __name__ == '__main__':
    main()

diar_infer_meeting.yaml

name: &name "ClusterDiarizer"

num_workers: 4
sample_rate: 16000
batch_size: 64

diarizer:
  manifest_filepath: /home/face/NeMo/examples/speaker_tasks/diarization/manifest_Ali_far.json
  out_dir: output_Ali_far
  oracle_vad: False # If True, uses RTTM files provided in the manifest file to get speech activity (VAD) timestamps
  collar: 0.25 # Collar value for scoring
  ignore_overlap: True # Consider or ignore overlap segments while scoring

  vad:
    model_path:  vad_multilingual_marblenet # .nemo local model path or pretrained VAD model name 
    external_vad_manifest: null # This option is provided to use external vad and provide its speech activity labels for speaker embeddings extraction. Only one of model_path or external_vad_manifest should be set

    parameters: # Tuned parameters for CH109 (using the 11 multi-speaker sessions as dev set) 
      window_length_in_sec: 0.63  # Window length in sec for VAD context input 
      shift_length_in_sec: 0.01 # Shift length in sec for generate frame level VAD prediction
      smoothing: False # False or type of smoothing method (eg: median)
      overlap: 0.5 # Overlap ratio for overlapped mean/median smoothing filter
      onset: 0.9 # Onset threshold for detecting the beginning and end of a speech 
      offset: 0.5 # Offset threshold for detecting the end of a speech
      pad_onset: 0 # Adding durations before each speech segment 
      pad_offset: 0 # Adding durations after each speech segment 
      min_duration_on: 0 # Threshold for small non_speech deletion
      min_duration_off: 0.6 # Threshold for short speech segment deletion
      filter_speech_first: True 

  speaker_embeddings:
    model_path: /home/face/NeMo/examples/speaker_tasks/diarization/spk_model/TitaNet-Finetune-encoder-cn2-300-adjust.nemo # .nemo local model path or pretrained model name (titanet_large, ecapa_tdnn or speakerverification_speakernet)
    parameters:
      window_length_in_sec: [3.0,2.5,2.0,1.5,1.0,0.5] # Window length(s) in sec (floating-point number). either a number or a list. ex) 1.5 or [1.5,1.0,0.5]
      shift_length_in_sec: [1.5,1.25,1.0,0.75,0.5,0.25] # Shift length(s) in sec (floating-point number). either a number or a list. ex) 0.75 or [0.75,0.5,0.25]
      multiscale_weights: [1,1,1,1,1,1] # Weight for each scale. should be null (for single scale) or a list matched with window/shift scale count. ex) [0.33,0.33,0.33]
      save_embeddings: True # If True, save speaker embeddings in pickle format. This should be True if clustering result is used for other models, such as `msdd_model`.

  clustering:
    parameters:
      oracle_num_speakers: False # If True, use num of speakers value provided in manifest file.
      max_num_speakers: 8 # Max number of speakers for each recording. If an oracle number of speakers is passed, this value is ignored.
      enhanced_count_thres: 80 # If the number of segments is lower than this number, enhanced speaker counting is activated.
      max_rp_threshold: 0.25 # Determines the range of p-value search: 0 < p <= max_rp_threshold. 
      sparse_search_volume: 30 # The higher the number, the more values will be examined with more time. 
      maj_vote_spk_count: False  # If True, take a majority vote on multiple p-values to estimate the number of speakers.

  msdd_model:
    model_path: ??? # .nemo local model path or pretrained model name for multiscale diarization decoder (MSDD)
    parameters:
      use_speaker_model_from_ckpt: True # If True, use speaker embedding model in checkpoint. If False, the provided speaker embedding model in config will be used.
      infer_batch_size: 25 # Batch size for MSDD inference. 
      sigmoid_threshold: [0.7] # Sigmoid threshold for generating binarized speaker labels. The smaller the more generous on detecting overlaps.
      seq_eval_mode: False # If True, use oracle number of speaker and evaluate F1 score for the given speaker sequences. Default is False.
      split_infer: True # If True, break the input audio clip to short sequences and calculate cluster average embeddings for inference.
      diar_window_length: 50 # The length of split short sequence when split_infer is True.
      overlap_infer_spk_limit: 5 # If the estimated number of speakers are larger than this number, overlap speech is not estimated.

  asr:
    model_path: stt_en_conformer_ctc_large # Provide NGC cloud ASR model name. stt_en_conformer_ctc_* models are recommended for diarization purposes.
    parameters:
      asr_based_vad: True # if True, speech segmentation for diarization is based on word-timestamps from ASR inference.
      asr_based_vad_threshold: 100 # Threshold (in sec) that caps the gap between two words when generating VAD timestamps using ASR based VAD.
      asr_batch_size: null # Batch size can be dependent on each ASR model. Default batch sizes are applied if set to null.
      decoder_delay_in_sec: null # Native decoder delay. null is recommended to use the default values for each ASR model.
      word_ts_anchor_offset: null # Offset to set a reference point from the start of the word. Recommended range of values is [-0.05  0.2]. 
      word_ts_anchor_pos: "start" # Select which part of the word timestamp we want to use. The options are: 'start', 'end', 'mid'.
      fix_word_ts_with_VAD: False # Fix the word timestamp using VAD output. You must provide a VAD model to use this feature.
      colored_text: False # If True, use colored text to distinguish speakers in the output transcript.
      print_time: True # If True, the start and end time of each speaker turn is printed in the output transcript.
      break_lines: False # If True, the output transcript breaks the line to fix the line width (default is 90 chars)

    ctc_decoder_parameters: # Optional beam search decoder (pyctcdecode)
      pretrained_language_model: null # KenLM model file: .arpa model file or .bin binary file.
      beam_width: 32
      alpha: 0.5
      beta: 2.5

    realigning_lm_parameters: # Experimental feature
      arpa_language_model: null # Provide a KenLM language model in .arpa format.
      min_number_of_words: 3 # Min number of words for the left context.
      max_number_of_words: 10 # Max number of words for the right context.
      logprob_diff_threshold: 1.2  # The threshold for the difference between two log probability values from two hypotheses.

Environment overview

Environment details

Additional context

GPU: GeForce RTX 2080 Ti

tango4j commented 1 year ago

asr_decoder_ts instance must be created to perform ASR with diarization. It should not be None. For this, you should provide a NeMo based ASR model to use ASR with diarization. It currently supports quartzNet, citrinet, and conformerCTC based ASR models. providing "???" to asr.model_path will throw out a bug because no ASR model is provided. Please train or download NeMo based ASR model from NGC and provide the .nemo file path to asr.model_path.

triumph9989 commented 1 year ago

@tango4j Thanks for your concern about my issue. I added my Conformer-CTC (EncDecCTCModel) directly in diar_infer_meeting.yaml.

  asr:
    model_path: /home/face/NeMo/examples/asr/exp/Conformer-CTC-Char-Aishell-100ep-lr0.9/2022-12-01_09-43-05/checkpoints/Conformer-CTC-Char-Aishell-100ep-lr0.9.nemo

bug

[NeMo I 2022-12-07 14:03:52 speaker_utils:92] Number of files to diarize: 1
[NeMo E 2022-12-07 14:04:05 common:505] Model instantiation failed!
    Target class:       nemo.collections.asr.models.ctc_models.EncDecCTCModel
    Error(s):   `cfg` must have `tokenizer` config to create a tokenizer !
    Traceback (most recent call last):
      File "/home/face/NeMo/nemo/core/classes/common.py", line 484, in from_config_dict
        instance = imported_cls(cfg=config, trainer=trainer)
      File "/home/face/NeMo/nemo/collections/asr/models/ctc_bpe_models.py", line 44, in __init__
        raise ValueError("`cfg` must have `tokenizer` config to create a tokenizer !")
    ValueError: `cfg` must have `tokenizer` config to create a tokenizer !

Error executing job with overrides: []
Traceback (most recent call last):
  File "offline_diar_with_asr_infer.py", line 54, in main
    asr_model = asr_decoder_ts.set_asr_model()
  File "/home/face/NeMo/nemo/collections/asr/parts/utils/decoder_timestamps_utils.py", line 358, in set_asr_model
    asr_model = self.encdec_class.restore_from(restore_path=self.ASR_model_name)
  File "/home/face/NeMo/nemo/core/classes/modelPT.py", line 316, in restore_from
    instance = cls._save_restore_connector.restore_from(
  File "/home/face/NeMo/nemo/core/connectors/save_restore_connector.py", line 235, in restore_from
    loaded_params = self.load_config_and_state_dict(
  File "/home/face/NeMo/nemo/core/connectors/save_restore_connector.py", line 158, in load_config_and_state_dict
    instance = calling_cls.from_config_dict(config=conf, trainer=trainer)
  File "/home/face/NeMo/nemo/core/classes/common.py", line 506, in from_config_dict
    raise e
  File "/home/face/NeMo/nemo/core/classes/common.py", line 498, in from_config_dict
    instance = cls(cfg=config, trainer=trainer)
  File "/home/face/NeMo/nemo/collections/asr/models/ctc_bpe_models.py", line 44, in __init__
    raise ValueError("`cfg` must have `tokenizer` config to create a tokenizer !")
ValueError: `cfg` must have `tokenizer` config to create a tokenizer !

Also, I tried the pre-trained model stt_en_conformer_ctc_large from NGC and another model /home/face/NeMo/examples/asr/exp/QuartzNet15x5-lr2.2-ep100.nemo I trained before, and they seemed no bug.

tango4j commented 1 year ago

@triumph9989 your model does not have a tokenizer. The decoder_timestamp_utils cannot be functioning with tokenizer in your model. Add the tokenizer to your model and check if it works.

Make sure to have the same class structure with the NeMo ASR model class. Otherwise, your ASR model won't be working with NeMo modules.

triumph9989 commented 1 year ago

@tango4j Thank you for being so helpful. I'm training with a tokenizer now. But I refer to NeMo's explanation, using the tokenizer represents using sub-word encoding. So, decoder_timestamp_utils can't use a character-based encoding ASR?

github-actions[bot] commented 1 year ago

This issue is stale because it has been open for 30 days with no activity. Remove stale label or comment or this will be closed in 7 days.

tango4j commented 1 year ago

@triumph9989 Since decoder_timestamp_utils can use QuartzNet, I suppose that char based tokenizer can be used. However, be careful when you replace NeMo ASR model class since the diarization + ASR framework assumes that we are using NeMo ASR class.