NVIDIA / NeMo

A scalable generative AI framework built for researchers and developers working on Large Language Models, Multimodal, and Speech AI (Automatic Speech Recognition and Text-to-Speech)
https://docs.nvidia.com/nemo-framework/user-guide/latest/overview.html
Apache License 2.0
11.96k stars 2.49k forks source link

[Speaker Diarization] Confusion Error become worse after fine-tuning titanet-large with chinese data #5269

Closed triumph9989 closed 1 year ago

triumph9989 commented 2 years ago

Hi, I used pretrain titanet-large on one audio of Alimeeting_Eval_far (Chinese), and its Confusion Error Rate (CER) is 0.0283 So I think if fine-tune titanet-large with Chinese data (CN-celeb2) it will have a better result. However, the first result of CER is 0.1473 (learning rate: 1-e4), and the second one is 0.2120 (learning rate: 1-e6) Do you have some suggestions?

titanet-finetune.yaml

name: &name "TitaNet-Finetune-cn2-300-30ep-lr1-e5"
sample_rate: &sample_rate 16000

init_from_pretrained_model:
  speaker_tasks:
    name: 'titanet_large'
    include: ["preprocessor","encoder"]
    exclude: ["decoder.final"] # Add specific layer names here to exlude or just ["decoder"] if to exclude all of decoder pretrained weights

model:
  train_ds:
    manifest_filepath: /home/ec5017b/media-lab/nemo/NeMo/examples/speaker_tasks/recognition/data/train.json
    sample_rate: 16000
    labels: null
    batch_size: 64
    shuffle: True
    is_tarred: False
    tarred_audio_filepaths: null
    tarred_shard_strategy: "scatter"
    augmentor:
      speed:
        prob: 0.3
        sr: *sample_rate
        resample_type: 'kaiser_fast'
        min_speed_rate: 0.95
        max_speed_rate: 1.05

  validation_ds:
    manifest_filepath: /home/ec5017b/media-lab/nemo/NeMo/examples/speaker_tasks/recognition/data/dev.json
    sample_rate: 16000
    labels: null
    batch_size: 128
    shuffle: False

  model_defaults:
    filters: 1024
    repeat: 3
    dropout: 0.1
    separable: true
    se: true
    se_context_size: -1
    kernel_size_factor: 1.0

  preprocessor:
    _target_: nemo.collections.asr.modules.AudioToMelSpectrogramPreprocessor
    normalize: "per_feature"
    window_size: 0.025
    sample_rate: *sample_rate
    window_stride: 0.01
    window: "hann"
    features: &n_mels 80
    n_fft: 512
    frame_splicing: 1
    dither: 0.00001

  encoder:
    _target_: nemo.collections.asr.modules.ConvASREncoder
    feat_in: *n_mels
    activation: relu
    conv_mask: true

    jasper:
      -   filters: ${model.model_defaults.filters}
          repeat: 1
          kernel: [3]
          stride: [1]
          dilation: [1]
          dropout: 0.0
          residual: false
          separable: ${model.model_defaults.separable}
          se: ${model.model_defaults.se}
          se_context_size: ${model.model_defaults.se_context_size}

      -   filters: ${model.model_defaults.filters}
          repeat:  ${model.model_defaults.repeat}
          kernel: [7]
          stride: [1]
          dilation: [1]
          dropout: ${model.model_defaults.dropout}
          residual: true
          separable: ${model.model_defaults.separable}
          se: ${model.model_defaults.se}
          se_context_size: ${model.model_defaults.se_context_size}

      -   filters: ${model.model_defaults.filters}
          repeat: ${model.model_defaults.repeat}
          kernel: [11]
          stride: [1]
          dilation: [1]
          dropout: ${model.model_defaults.dropout}
          residual: true
          separable: ${model.model_defaults.separable}
          se: ${model.model_defaults.se}
          se_context_size: ${model.model_defaults.se_context_size}

      -   filters: ${model.model_defaults.filters}
          repeat: ${model.model_defaults.repeat}
          kernel: [15]
          stride: [1]
          dilation: [1]
          dropout: ${model.model_defaults.dropout}
          residual: true
          separable: ${model.model_defaults.separable}
          se: ${model.model_defaults.se}
          se_context_size: ${model.model_defaults.se_context_size}

      -   filters: &enc_feat_out 3072
          repeat: 1
          kernel: [1]
          stride: [1]
          dilation: [1]
          dropout: 0.0
          residual: false
          separable: ${model.model_defaults.separable}
          se: ${model.model_defaults.se}
          se_context_size: ${model.model_defaults.se_context_size}

  decoder:
    _target_: nemo.collections.asr.modules.SpeakerDecoder
    feat_in: *enc_feat_out
    num_classes: 299
    pool_mode: 'attention'
    emb_sizes: 192
    angular: True

  loss:
    scale: 30
    margin: 0.2

  optim:
    name: adamw
    lr: 0.0001 #(original titanet-large was trained with 0.08 lr) 
    weight_decay: 0.0002

    # scheduler setup
    sched:
      name: CosineAnnealing
      warmup_ratio: 0.1
      min_lr: 0.0

trainer:
  devices: 1 # number of gpus (original titanet-large was trained on 4 nodes with 8 gpus each)
  max_epochs: 30
  max_steps: -1 # computed at runtime if not set
  num_nodes: 1
  accelerator: gpu
  strategy: ddp
  deterministic: True
  enable_checkpointing: False
  logger: False
  log_every_n_steps: 1  # Interval of logging.
  val_check_interval: 1.0  # Set to 0.25 to check 4 times per epoch, or an int for number of iterations

exp_manager:
  exp_dir: null
  name: *name
  create_tensorboard_logger: True
  create_checkpoint_callback: True

Add any other context about the problem here. GV100GL [Tesla V100 PCIe 16GB] (rev a1)

tango4j commented 1 year ago

Hi, @triumph9989 . Thank you for uploading the feedback. Please make sure you do not label this type of topics as "bug" label since the software is not showing any signs of malfunctioning features. NeMo contributors should differentiate the actual bug from some topics for debate.

Can you also share the data preparation steps? Especially the segment lengths and the size of the fine-tuning data.

Also please share your evaluation setup. default is collar=0.25, ignore_overlap=True.

triumph9989 commented 1 year ago

@tango4j Sorry about not changing the label, and thanks for concern my issue. The followings are my checkpoints

'TitaNet-Finetune-encoder-cn2-300-adjust--val_loss=0.1640-epoch=9-last.ckpt'  
'TitaNet-Finetune-encoder-cn2-300-adjust--val_loss=0.1668-epoch=6.ckpt'
'TitaNet-Finetune-encoder-cn2-300-adjust--val_loss=0.1640-epoch=9.ckpt'        
TitaNet-Finetune-encoder-cn2-300-adjust.nemo
'TitaNet-Finetune-encoder-cn2-300-adjust--val_loss=0.1657-epoch=8.ckpt'

I used the TitaNet-Finetune-encoder-cn2-300-adjust.nemo and it get the worse Confusion Error rate. After, I used'TitaNet-Finetune-encoder-cn2-300-adjust--val_loss=0.1640-epoch=9.ckpt' and the Confusion Error rate was better than using pre-trained titanet-large, but I don't know why.

offline_diarization_with_asr.yaml

name: &name "ClusterDiarizer"

num_workers: 4
sample_rate: 16000
batch_size: 64

diarizer:
  manifest_filepath: /workspace/nemo/examples/speaker_tasks/diarization/manifest_Ali_far.json #manifest_Ali_far.json
  out_dir: output_SD_ASR_VAD_TH_100_ali_far_trimmed_test
  oracle_vad: False # If True, uses RTTM files provided in the manifest file to get speech activity (VAD) timestamps
  collar: 0.25 # Collar value for scoring
  ignore_overlap: True # Consider or ignore overlap segments while scoring

  vad:
    model_path: null # .nemo local model path or pretrained model name or none
    external_vad_manifest: null # This option is provided to use external vad and provide its speech activity labels for speaker embeddings extraction. Only one of model_path or exx
ternal_vad_manifest should be set

    parameters: # Tuned parameters for CH109 (using the 11 multi-speaker sessions as dev set) 
      window_length_in_sec: 0.15  # Window length in sec for VAD context input 
      shift_length_in_sec: 0.01 # Shift length in sec for generate frame level VAD prediction
      smoothing: "median" # False or type of smoothing method (eg: median)
      overlap: 0.875 # Overlap ratio for overlapped mean/median smoothing filter
      onset: 0.4 # Onset threshold for detecting the beginning and end of a speech 
      offset: 0.7 # Offset threshold for detecting the end of a speech
      pad_onset: 0.05 # Adding durations before each speech segment 
      pad_offset: -0.1 # Adding durations after each speech segment 
      min_duration_on: 0.2 # Threshold for small non_speech deletion
      min_duration_off: 0.2 # Threshold for short speech segment deletion
      filter_speech_first: True

  speaker_embeddings:
    model_path: titanet_large # .nemo local model path or pretrained model name (titanet_large, ecapa_tdnn or speakerverification_speakernet) TitaNet-Finetune-cn2-300-30ep-lr1-e5.nn
emo, /workspace/nemo/examples/speaker_tasks/recognition/exp/TitaNet-Finetune-encoder-cn2-300-adjust/2022-10-28_05-48-05/checkpoints/TitaNet-Finetune-encoder-cn2-300-adjust--val_loss
s=0.1640-epoch=9-last.ckpt
    parameters:
      window_length_in_sec: 1.5 # Window length(s) in sec (floating-point number). Either a number or a list. Ex) 1.5 or [1.5,1.0,0.5]
      shift_length_in_sec: 0.75 # Shift length(s) in sec (floating-point number). Either a number or a list. Ex) 0.75 or [0.75,0.5,0.25]
      multiscale_weights: null # Weight for each scale. should be null (for single scale) or a list matched with window/shift scale count. Ex) [0.33,0.33,0.33]
      save_embeddings: False # Save speaker embeddings in pickle format.

  clustering:
    parameters:
      oracle_num_speakers: False # If True, use num of speakers value provided in manifest file.
      max_num_speakers: 20 # Max number of speakers for each recording. If an oracle number of speakers is passed, this value is ignored.
      enhanced_count_thres: 80 # If the number of segments is lower than this number, enhanced speaker counting is activated.
      max_rp_threshold: 0.25 # Determines the range of p-value search: 0 < p <= max_rp_threshold. 0.25 0.10
      sparse_search_volume: 30 # The higher the number, the more values will be examined with more time. 

  asr:
    model_path: stt_en_conformer_ctc_large # Provide NGC cloud ASR model name. stt_en_conformer_ctc_* models are recommended for diarization purposes.
    parameters:
      asr_based_vad: True # if True, speech segmentation for diarization is based on word-timestamps from ASR inference.
      asr_based_vad_threshold: 100 # Threshold (in sec) that caps the gap between two words when generating VAD timestamps using ASR based VAD.
      asr_batch_size: null # Batch size can be dependent on each ASR model. Default batch sizes are applied if set to null.
      lenient_overlap_WDER: True # If true, when a word falls into  speaker-overlappedregions, consider the word as a correctly diarized word.
      decoder_delay_in_sec: null # Native decoder delay. null is recommended to use the default values for each ASR model.
      word_ts_anchor_offset: null # Offset to set a reference point from the start of the word. Recommended range of values is [-0.05  0.2]. null
      word_ts_anchor_pos: "start" # Select which part of the word timestamp we want to use. The options are: 'start', 'end', 'mid'.
      fix_word_ts_with_VAD: False # Fix the word timestamp using VAD output. You must provide a VAD model to use this feature.
      colored_text: False # If True, use colored text to distinguish speakers in the output transcript.
      print_time: True # If True, the start and end time of each speaker turn is printed in the output transcript.
      break_lines: False # If True, the output transcript breaks the line to fix the line width (default is 90 chars)

    ctc_decoder_parameters: # Optional beam search decoder (pyctcdecode)
      pretrained_language_model: null # KenLM model file: .arpa model file or .bin binary file.
      beam_width: 32
      alpha: 0.5
      beta: 2.5

    realigning_lm_parameters: # Experimental feature
      arpa_language_model: null # Provide a KenLM language model in .arpa format.
      min_number_of_words: 3 # Min number of words for the left context.
      max_number_of_words: 10 # Max number of words for the right context.
      logprob_diff_threshold: 1.2  # The threshold for the difference between two log probability values from two hypotheses.

# json manifest line example 
# {"audio_filepath": "/path/to/audio_file", "offset": 0, "duration": null, "label": "infer", "text": "-", "num_speakers": null, "rttm_filepath": "/path/to/rttm/file", "uem_filepathh
": "/path/to/uem/file"}

data preparation it segments all original audio into 3 seconds. (--create_segments)

import os
NEMO_ROOT = os.getcwd()
print(NEMO_ROOT)
import glob
import subprocess
import tarfile
import wget

# data_dir = os.path.join('/home/ec5017b/media-lab/nemo/NeMo/examples/speaker_tasks/diarization/','data')
data_dir='/workspace/nemo/examples/speaker_tasks/recognition/data'
nemo_root='/workspace/nemo'
wav_dir='/CN-celeb2-300/data/'
dest_dir='/CN-celeb2-300/'
os.system('find {}{}  -iname "*.wav" > {}{}train_all.txt'.format(data_dir,wav_dir,data_dir,dest_dir))
os.system('python {}/scripts/speaker_tasks/filelist_to_manifest.py --create_segments --filelist {}{}train_all.txt --id -2 --out {}{}all_manifest.json --split'.format(nemo_root,dataa
_dir,dest_dir,data_dir,dest_dir))
github-actions[bot] commented 1 year ago

This issue is stale because it has been open for 30 days with no activity. Remove stale label or comment or this will be closed in 7 days.

github-actions[bot] commented 1 year ago

This issue was closed because it has been inactive for 7 days since being marked as stale.