NVIDIA / NeMo

A scalable generative AI framework built for researchers and developers working on Large Language Models, Multimodal, and Speech AI (Automatic Speech Recognition and Text-to-Speech)
https://docs.nvidia.com/nemo-framework/user-guide/latest/overview.html
Apache License 2.0
11.93k stars 2.48k forks source link

[ASR_with_SpeakerDiarization] Fine-Tuned TinaNet predicts only one speaker #5249

Closed Klassikcat closed 1 year ago

Klassikcat commented 1 year ago

Describe the bug

I've fine-tuned TitaNet-Large model 10 Epoch for Korean with 1,000,000 datas and 60 speaker dataset because NGC TitaNet(for English) only predict one token. I've checked loss has decreased not only in training step, but also in validation step(min loss was 0.0079).

However, the fine-tuned model only predict one token, <speaker 0>:

{'sample': ['1.8 2.16 speaker_0',
  '2.32 2.52 speaker_0',
  '2.92 3.4 speaker_0',
  '4.16 5.0 speaker_0',
  '7.44 8.08 speaker_0',
  '8.88 9.16 speaker_0',
  '9.48 9.84 speaker_0',
  '12.76 13.08 speaker_0',
  '13.48 13.68 speaker_0',
  '14.76 15.24 speaker_0',
  '15.8 16.400000000000002 speaker_0',
  '16.88 17.04 speaker_0',
  '17.2 18.24 speaker_0',
  '20.4 20.72 speaker_0',
  '20.88 21.24 speaker_0',
  '25.2 25.68 speaker_0',
...]

Since label in the manifest file is integer, it seems it is a label-related issue. But there is a few reasons i can assume.

  1. As I expected in the above, it is a label-related issue.
  2. 1,000,000 data and 60 speakers are too small for fine-tuning TitaNet
  3. There are problems in the Configuration yaml file.
  4. The model is overfitted to training data
  5. pre-trained Model weight is not initialized

+ Additional Information

Diarization score is None in the below code(in the ASR_with_SpeakerDiarization.ipynb)

diar_hyp, diar_score = asr_diar_offline.run_diarization(cfg, word_ts_hyp)

Steps/Code to reproduce bug

Example of manifest file for training:

{"audio_filepath":"\/home\/me\/datas\/speaker_tasks\/datas\/1.Training\/original\/call\/2021-12-16\/3544\/A0210-3544M2010-11020010-06897809.wav","label":3544,"offset":0,"duration":1.38}

YAML Configuration for training

cfg:
  train_ds:
    manifest_filepath: train_speakers.json
    sample_rate: 16000
    labels:
    - 8
    - 191
    - 236
    - 366
    - 423
    - 432
    - 575
    - 624
    - 776
    - 922
    - 1053
    - 1057
    - 1123
    - 1136
    - 1221
    - 1330
    - 1347
    - 1379
    - 1462
    - 1553
    - 1625
    - 1667
    - 1718
    - 1964
    - 1989
    - 2062
    - 2104
    - 2122
    - 2131
    - 2177
    - 2254
    - 2357
    - 2556
    - 2615
    - 2662
    - 2674
    - 2704
    - 2758
    - 2805
    - 2987
    - 3007
    - 3070
    - 3093
    - 3159
    - 3499
    - 3544
    - 3548
    - 3562
    - 3573
    - 3580
    - 3604
    - 3664
    - 3715
    - 3740
    - 3775
    - 3791
    - 3858
    - 3988
    - 4055
    - 9036
    batch_size: 32
    shuffle: true
    is_tarred: false
    tarred_audio_filepaths: null
    tarred_shard_strategy: scatter
    augmentor:
      speed:
        prob: 0.3
        sr: 16000
        resample_type: kaiser_fast
        min_speed_rate: 0.95
        max_speed_rate: 1.05
  validation_ds:
    manifest_filepath: eval_speakers.json
    sample_rate: 16000
    labels: null
    batch_size: 32
    shuffle: false
  model_defaults:
    filters: 1024
    repeat: 3
    dropout: 0.1
    separable: true
    se: true
    se_context_size: -1
    kernel_size_factor: 1.0
  preprocessor:
    _target_: nemo.collections.asr.modules.AudioToMelSpectrogramPreprocessor
    normalize: per_feature
    window_size: 0.025
    sample_rate: 16000
    window_stride: 0.01
    window: hann
    features: 80
    n_fft: 512
    frame_splicing: 1
    dither: 1.0e-05
  spec_augment:
    _target_: nemo.collections.asr.modules.SpectrogramAugmentation
    freq_masks: 0
    freq_width: 4
    time_masks: 0
    time_width: 0.03
  encoder:
    _target_: nemo.collections.asr.modules.ConvASREncoder
    feat_in: 80
    activation: relu
    conv_mask: true
    jasper:
    - filters: 1024
      repeat: 1
      kernel:
      - 3
      stride:
      - 1
      dilation:
      - 1
      dropout: 0.0
      residual: false
      separable: true
      se: true
      se_context_size: -1
    - filters: 1024
      repeat: 3
      kernel:
      - 7
      stride:
      - 1
      dilation:
      - 1
      dropout: 0.1
      residual: true
      separable: true
      se: true
      se_context_size: -1
    - filters: 1024
      repeat: 3
      kernel:
      - 11
      stride:
      - 1
      dilation:
      - 1
      dropout: 0.1
      residual: true
      separable: true
      se: true
      se_context_size: -1
    - filters: 1024
      repeat: 3
      kernel:
      - 15
      stride:
      - 1
      dilation:
      - 1
      dropout: 0.1
      residual: true
      separable: true
      se: true
      se_context_size: -1
    - filters: 3072
      repeat: 1
      kernel:
      - 1
      stride:
      - 1
      dilation:
      - 1
      dropout: 0.0
      residual: false
      separable: true
      se: true
      se_context_size: -1
  decoder:
    _target_: nemo.collections.asr.modules.SpeakerDecoder
    feat_in: 3072
    num_classes: 60
    pool_mode: attention
    emb_sizes: 192
    angular: false
  loss:
    scale: 30
    margin: 0.2
  optim:
    name: sgd
    lr: 0.006
    weight_decay: 0.001
    sched:
      name: CosineAnnealing
      warmup_ratio: 0.1
      min_lr: 0.0
  target: nemo.collections.asr.models.label_models.EncDecSpeakerLabelModel
  nemo_version: 1.12.0

YAML Configuration for prediction

# This YAML file is created for all types of offline speaker diarization inference tasks in `<NeMo git root>/example/speaker_tasks/diarization` folder.
# The inference parameters for VAD, speaker embedding extractor, clustering module, MSDD module, ASR decoder are all included in this YAML file. 
# All the keys under `diarizer` key (`vad`, `speaker_embeddings`, `clustering`, `msdd_model`, `asr`) can be selectively used for its own purpose and also can be ignored if the module is not used.
# The configurations in this YAML file is suitable for telephone recordings involving 2~8 speakers in a session and may not show the best performance on the other types of acoustic conditions or dialogues.
# An example line in an input manifest file (`.json` format):
# {"audio_filepath": "/path/to/audio_file", "offset": 0, "duration": null, "label": "infer", "text": "-", "num_speakers": null, "rttm_filepath": "/path/to/rttm/file", "uem_filepath": "/path/to/uem/file"}
name: &name "ClusterDiarizer"

num_workers: 4
sample_rate: 16000
batch_size: 64

diarizer:
  manifest_filepath: '/home/me/codes/NeMo/input_manifest.json'
  out_dir: ???
  oracle_vad: False # If True, uses RTTM files provided in the manifest file to get speech activity (VAD) timestamps
  collar: 0.25 # Collar value for scoring
  ignore_overlap: True # Consider or ignore overlap segments while scoring

  vad:
    model_path: ??? # .nemo local model path or pretrained VAD model name 
    external_vad_manifest: null # This option is provided to use external vad and provide its speech activity labels for speaker embeddings extraction. Only one of model_path or external_vad_manifest should be set

    parameters: # Tuned parameters for CH109 (using the 11 multi-speaker sessions as dev set) 
      window_length_in_sec: 0.15  # Window length in sec for VAD context input 
      shift_length_in_sec: 0.01 # Shift length in sec for generate frame level VAD prediction
      smoothing: "median" # False or type of smoothing method (eg: median)
      overlap: 0.5 # Overlap ratio for overlapped mean/median smoothing filter
      onset: 0.1 # Onset threshold for detecting the beginning and end of a speech 
      offset: 0.1 # Offset threshold for detecting the end of a speech
      pad_onset: 0.1 # Adding durations before each speech segment 
      pad_offset: 0 # Adding durations after each speech segment 
      min_duration_on: 0 # Threshold for small non_speech deletion
      min_duration_off: 0.2 # Threshold for short speech segment deletion
      filter_speech_first: True 

  speaker_embeddings:
    model_path: /'home/insutil/codes/NeMo/nemo_experiments/TitaNet/2022-10-17_19-48-25/checkpoints/TitaNet.nemo' # .nemo local model path or pretrained model name (titanet_large, ecapa_tdnn or speakerverification_speakernet)
    parameters:
      window_length_in_sec: [1.5,1.25,1.0,0.75,0.5] # Window length(s) in sec (floating-point number). either a number or a list. ex) 1.5 or [1.5,1.0,0.5]
      shift_length_in_sec: [0.75,0.625,0.5,0.375,0.25] # Shift length(s) in sec (floating-point number). either a number or a list. ex) 0.75 or [0.75,0.5,0.25]
      multiscale_weights: [1,1,1,1,1] # Weight for each scale. should be null (for single scale) or a list matched with window/shift scale count. ex) [0.33,0.33,0.33]
      save_embeddings: True # If True, save speaker embeddings in pickle format. This should be True if clustering result is used for other models, such as `msdd_model`.

  clustering:
    parameters:
      oracle_num_speakers: False # If True, use num of speakers value provided in manifest file.
      max_num_speakers: 2 # Max number of speakers for each recording. If an oracle number of speakers is passed, this value is ignored.
      enhanced_count_thres: 80 # If the number of segments is lower than this number, enhanced speaker counting is activated.
      max_rp_threshold: 0.25 # Determines the range of p-value search: 0 < p <= max_rp_threshold. 
      sparse_search_volume: 30 # The higher the number, the more values will be examined with more time. 
      maj_vote_spk_count: False  # If True, take a majority vote on multiple p-values to estimate the number of speakers.

  msdd_model:
    model_path: ??? # .nemo local model path or pretrained model name for multiscale diarization decoder (MSDD)
    parameters:
      use_speaker_model_from_ckpt: True # If True, use speaker embedding model in checkpoint. If False, the provided speaker embedding model in config will be used.
      infer_batch_size: 25 # Batch size for MSDD inference. 
      sigmoid_threshold: [0.7] # Sigmoid threshold for generating binarized speaker labels. The smaller the more generous on detecting overlaps.
      seq_eval_mode: False # If True, use oracle number of speaker and evaluate F1 score for the given speaker sequences. Default is False.
      split_infer: True # If True, break the input audio clip to short sequences and calculate cluster average embeddings for inference.
      diar_window_length: 50 # The length of split short sequence when split_infer is True.
      overlap_infer_spk_limit: 5 # If the estimated number of speakers are larger than this number, overlap speech is not estimated.

  asr:
    model_path: /home/me/codes/NeMo/checkpoints/conformer/Conformer-CTC-BPE.nemo # Provide NGC cloud ASR model name. stt_en_conformer_ctc_* models are recommended for diarization purposes.
    parameters:
      asr_based_vad: False # if True, speech segmentation for diarization is based on word-timestamps from ASR inference.
      asr_based_vad_threshold: 0.05 # Threshold (in sec) that caps the gap between two words when generating VAD timestamps using ASR based VAD.
      asr_batch_size: null # Batch size can be dependent on each ASR model. Default batch sizes are applied if set to null.
      lenient_overlap_WDER: True # If true, when a word falls into  speaker-overlappedregions, consider the word as a correctly diarized word.
      decoder_delay_in_sec: null # Native decoder delay. null is recommended to use the default values for each ASR model.
      word_ts_anchor_offset: null # Offset to set a reference point from the start of the word. Recommended range of values is [-0.05  0.2]. 
      word_ts_anchor_pos: "start" # Select which part of the word timestamp we want to use. The options are: 'start', 'end', 'mid'.
      fix_word_ts_with_VAD: False # Fix the word timestamp using VAD output. You must provide a VAD model to use this feature.
      colored_text: False # If True, use colored text to distinguish speakers in the output transcript.
      print_time: True # If True, the start and end time of each speaker turn is printed in the output transcript.
      break_lines: False # If True, the output transcript breaks the line to fix the line width (default is 90 chars)

    ctc_decoder_parameters: # Optional beam search decoder (pyctcdecode)
      pretrained_language_model: null # KenLM model file: .arpa model file or .bin binary file.
      beam_width: 32
      alpha: 0.5
      beta: 2.5

    realigning_lm_parameters: # Experimental feature
      arpa_language_model: null # Provide a KenLM language model in .arpa format.
      min_number_of_words: 3 # Min number of words for the left context.
      max_number_of_words: 10 # Max number of words for the right context.
      logprob_diff_threshold: 1.2  # The threshold for the difference between two log probability values from two hypotheses.

### Training code
```python

from omegaconf import OmegaConf

import pytorch_lightning as pl

try:
    from ruamel.yaml import YAML
except ModuleNotFoundError:
    from ruamel_yaml import YAML

from nemo.core.config import hydra_runner
from nemo.utils import logging
from nemo.utils.exp_manager import exp_manager
import nemo.collections.asr as nemo_asr

@hydra_runner(config_path="../../model_cards/", config_name='config')
def __main__(cfg):
    """
    Train a Speaker embedding model.
    Args:
        cfg: DictConfig contains model configurations.
    """

    logging.info(f"hydra config: {OmegaConf.to_yaml(cfg)}")

    trainer = pl.Trainer(**cfg.trainer)
    exp_manager(trainer, cfg.get('exp_manager', None))
    model = nemo_asr.models.label_models.EncDecSpeakerLabelModel(cfg=cfg)

    model.setup_training_data(cfg.model.train_ds)
    model.setup_validation_data(cfg.model.validation_ds)
    model.setup_test_data(cfg.model.test_ds)

    model.maybe_init_from_pretrained_checkpoint(cfg.init_from_pretrained_model)

    trainer.fit(model)

    if hasattr(cfg.model, 'test_ds') and cfg.model.test_ds.manifest_filepath is not None:
        if model.prepare_test(trainer):
            trainer.test(model)

prediction: Uses ASR_with_SpeakerDiarization.ipynb notebook in tutorials/speaker_tasks

Expected behavior

Environment overview (please complete the following information)

Environment details

nithinraok commented 1 year ago

Did you use VAD from ASR or native nemo VAD?

Klassikcat commented 1 year ago

@nithinraok I've used VAD from ASR(Conformer-CTC-BPE).

nithinraok commented 1 year ago

Can you switch it to use native vad vad_multilingual_marblenet .

Or if you would like to use VAD from ASR ... you could try changing asr_based_vad_threshold to 1.0

Klassikcat commented 1 year ago

@nithinraok thanks for your help, but both of them were not worked.

However, I found an interesting things in thing in speaker_outputs folder(Output directory of Speaker Embedding)

All subsegments_scale.json file's labels were "UNK" and all of uniq_id were null.

could it be related to the problem that I'm facing?

Codes

1. Change native VAD(vad_multilingual_marblenet)

from omegaconf import OmegaConf

pretrained_vad = 'vad_multilingual_marblenet'
pretrained_speaker_model = os.path.join(os.getcwd(), 'nemo_experiments', 'TitaNet', '2022-10-17_19-48-25', 'checkpoints', 'TitaNet.nemo')

cfg = OmegaConf.load(os.path.join(os.getcwd(), 'model_cards', 'diarization', 'diar_infer_telephonic.yaml'))

cfg.num_workers = 1
cfg.diarizer.manifest_filepath = os.path.join(os.getcwd(), 'input_manifest.json')
cfg.diarizer.out_dir = 'data/' # Directory to store intermediate files and prediction outputs

cfg.diarizer.speaker_embeddings.model_path = pretrained_speaker_model
cfg.diarizer.oracle_vad = False # compute VAD provided with model_path to vad config
cfg.diarizer.clustering.parameters.oracle_num_speakers=False

#Here we use our inhouse pretrained NeMo VAD
cfg.diarizer.vad.model_path = pretrained_vad
cfg.diarizer.vad.parameters.onset = 0.8
cfg.diarizer.vad.parameters.offset = 0.6
cfg.diarizer.vad.parameters.pad_offset = -0.05

from nemo.collections.asr.models import ClusteringDiarizer
sd_model = ClusteringDiarizer(cfg=cfg)

sd_model.diarize()

2. asr_based_vad_threshold = 1.0

from omegaconf import OmegaConf
cfg = OmegaConf.load(os.path.join(os.getcwd(), 'model_cards', 'diarization', 'diar_infer_telephonic.yaml'))
cfg.diarizer.manifest_filepath = os.path.join(os.getcwd(), 'input_manifest.json')
cfg.diarizer.speaker_embeddings.model_path = os.path.join(os.getcwd(), 'nemo_experiments', 'TitaNet', '2022-10-17_19-48-25', 'checkpoints', 'TitaNet.nemo')
cfg.diarizer.clustering.parameters.max_num_speakers = 8
cfg.diarizer.asr.model_path = os.path.join(os.getcwd(), 'checkpoints', 'conformer', 'Conformer-CTC-BPE.nemo')
cfg.diarizer.out_dir = os.getcwd()

cfg.diarizer.asr. asr_based_vad_threshold = 1.0

3. a fraction of subsegment_scale4.json

{"audio_filepath": "/home/insutil/codes/NeMo/data/sample.wav", "offset": 239.87, "duration": 0.5, "label": "UNK", "uniq_id": null}
{"audio_filepath": "/home/insutil/codes/NeMo/data/sample.wav", "offset": 240.12, "duration": 0.5, "label": "UNK", "uniq_id": null}
{"audio_filepath": "/home/insutil/codes/NeMo/data/sample.wav", "offset": 240.37, "duration": 0.5, "label": "UNK", "uniq_id": null}
{"audio_filepath": "/home/insutil/codes/NeMo/data/sample.wav", "offset": 240.62, "duration": 0.5, "label": "UNK", "uniq_id": null}
{"audio_filepath": "/home/insutil/codes/NeMo/data/sample.wav", "offset": 240.87, "duration": 0.28999999999999204, "label": "UNK", "uniq_id": null}
{"audio_filepath": "/home/insutil/codes/NeMo/data/sample.wav", "offset": 242.14, "duration": 0.37999999999999545, "label": "UNK", "uniq_id": null}
{"audio_filepath": "/home/insutil/codes/NeMo/data/sample.wav", "offset": 243.34, "duration": 0.30000000000001137, "label": "UNK", "uniq_id": null}
Klassikcat commented 1 year ago
    Train config : 
    manifest_filepath: train_speakers.json
    sample_rate: 16000
    labels:
    - 8
    - 191
    - 236
    - 366
    - 423
    - 432
    - 575
    - 624
    - 776
    - 922
    - 1053
    - 1057
    - 1123
    - 1136
    - 1221
    - 1330
    - 1347
    - 1379
    - 1462
    - 1553
    - 1625
    - 1667
    - 1718
    - 1964
    - 1989
    - 2062
    - 2104
    - 2122
    - 2131
    - 2177
    - 2254
    - 2357
    - 2556
    - 2615
    - 2662
    - 2674
    - 2704
    - 2758
 - 2805
    - 2987
    - 3007
    - 3070
    - 3093
    - 3159
    - 3499
    - 3544
    - 3548
    - 3562
    - 3573
    - 3580
    - 3604
    - 3664
    - 3715
    - 3740
    - 3775
    - 3791
    - 3858
    - 3988
    - 4055
    - 9036
    batch_size: 32
    shuffle: true
    is_tarred: false
    tarred_audio_filepaths: null
    tarred_shard_strategy: scatter
    augmentor:
      speed:
        prob: 0.3
        sr: 16000
        resample_type: kaiser_fast
        min_speed_rate: 0.95
        max_speed_rate: 1.05

Validation config : 
    manifest_filepath: eval_speakers.json
    sample_rate: 16000
    labels: null
    batch_size: 32
    shuffle: false
Klassikcat commented 1 year ago

It turns out that there are some problems with weights or settings in my YAML. using NCG TitaNet with fine-tuned conformer works fine when cfg.diarizer.asr.parameters.asr_based_vad_threshold = 1.0 even for Korean. thanks for your help @nithinraok

Sreeni1204 commented 1 year ago

Hello @Klassikcat @nithinraok Could you give me some update on this issue on how it was solved? I am facing the same issue with my speaker recognition model. I finetuned the titanet model and with batch inference on multiple audio files in the test set the accuracy is around 95%, but the model returns single label for each audio file when tested separate using the get_label() function.

Klassikcat commented 1 year ago

@Sreeni1204

  1. The issue you're facing seems to issue of yaml configuration in light of issue i've faced before. I think the issue is related to loss and vad threshold. For the threshold, you should set to 1.0. For the loss, there is comment that how to set the loss in speaker embedding. Default setting is for speaker verification.

  2. If that doesn't work, try titanet-large-en weight from ngc without fine-tuning. It worked for Korean voice with fine-tuned conformer, so it would work in Japanese, English, and other languages as well.