NVIDIA / NeMo

A scalable generative AI framework built for researchers and developers working on Large Language Models, Multimodal, and Speech AI (Automatic Speech Recognition and Text-to-Speech)
https://docs.nvidia.com/nemo-framework/user-guide/latest/overview.html
Apache License 2.0
11.95k stars 2.49k forks source link

[Speaker diarization] Can't finish the inference #4793

Closed triumph9989 closed 2 years ago

triumph9989 commented 2 years ago

Describe the bug Hi, I'm new to nemo system. I have no idea why my program stops at "generating predictions with overlapping input segments".

[NeMo I 2022-08-23 15:08:22 features:200] PADDING: 16
[NeMo I 2022-08-23 15:08:23 label_models:100] loss is Angular Softmax
[NeMo I 2022-08-23 15:08:23 save_restore_connector:243] Model EncDecSpeakerLabelModel was successfully restored from /home/ec5017b/.cache/torch/NeMo/NeMo_1.10.0/titanet-l/492c0ab8416139171dc18c21879a9e45/titanet-l.nemo.
[NeMo I 2022-08-23 15:08:23 speaker_utils:82] Number of files to diarize: 1
[NeMo I 2022-08-23 15:08:23 clustering_diarizer:293] Split long audio file to avoid CUDA memory issue
100%|████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  1.32it/s]
[NeMo I 2022-08-23 15:08:24 vad_utils:89] The prepared manifest file exists. Overwriting!
[NeMo I 2022-08-23 15:08:24 classification_models:244] Perform streaming frame-level VAD
[NeMo I 2022-08-23 15:08:24 collections:289] Filtered duration for loading collection is 0.000000.
[NeMo I 2022-08-23 15:08:24 collections:293] # 4 files loaded accounting to # 1 labels
100%|████████████████████████████████████████████████████████████████████| 4/4 [00:03<00:00,  1.09it/s]
[NeMo I 2022-08-23 15:08:28 clustering_diarizer:242] Generating predictions with overlapping input segments
  0%|                                                                            | 0/1 [00:00<?, ?it/s]

Steps/Code to reproduce bug

python offline_diarization.py

my manifest.json {"audio_filepath": "/home/ec5017b/media-lab/nemo/NeMo/examples/speaker_tasks/diarization/data/voxconverse/voxconverse_test_wav/voxconverse_test_wav/aepyx.wav", "offset": 0, "duration": null, "label": "infer", "text": "-"}

offline_diarization.yaml

name: &name "ClusterDiarizer"

num_workers: 4
sample_rate: 16000
batch_size: 64

diarizer:
  manifest_filepath: /home/ec5017b/media-lab/nemo/NeMo/examples/speaker_tasks/diarization/manifest.json
  out_dir: /home/ec5017b/media-lab/nemo/NeMo/examples/speaker_tasks/diarization/output
  oracle_vad: False # If True, uses RTTM files provided in manifest file to get speech activity (VAD) timestamps
  collar: 0.25 # Collar value for scoring
  ignore_overlap: True # Consider or ignore overlap segments while scoring

  vad:
    model_path: vad_marblenet # .nemo local model path or pretrained model name or none
    external_vad_manifest: null # This option is provided to use external vad and provide its speech activity labels for speaker embeddings extraction. Only one of model_path or external_vad_manifest should be set

    parameters: # Tuned parameters for CH109 (using the 11 multi-speaker sessions as dev set) 
      window_length_in_sec: 0.15  # Window length in sec for VAD context input 
      shift_length_in_sec: 0.01 # Shift length in sec for generate frame level VAD prediction
      smoothing: "median" # False or type of smoothing method (eg: median)
      overlap: 0.875 # Overlap ratio for overlapped mean/median smoothing filter
      onset: 0.4 # Onset threshold for detecting the beginning and end of a speech 
      offset: 0.7 # Offset threshold for detecting the end of a speech
      pad_onset: 0.05 # Adding durations before each speech segment 
      pad_offset: -0.1 # Adding durations after each speech segment 
      min_duration_on: 0.2 # Threshold for small non_speech deletion
      min_duration_off: 0.2 # Threshold for short speech segment deletion
      filter_speech_first: True 

  speaker_embeddings:
    model_path: titanet_large # .nemo local model path or pretrained model name (titanet_large, ecapa_tdnn or speakerverification_speakernet)
    parameters:
      window_length_in_sec: 1.5 # Window length(s) in sec (floating-point number). Either a number or a list. Ex) 1.5 or [1.5,1.0,0.5]
      shift_length_in_sec: 0.75 # Shift length(s) in sec (floating-point number). Either a number or a list. Ex) 0.75 or [0.75,0.5,0.25]
      multiscale_weights: null # Weight for each scale. should be null (for single scale) or a list matched with window/shift scale count. Ex) [0.33,0.33,0.33]
      save_embeddings: False # Save embeddings as pickle file for each audio input.

  clustering:
    parameters:
      oracle_num_speakers: False # If True, use num of speakers value provided in manifest file.
      max_num_speakers: 20 # Max number of speakers for each recording. If oracle num speakers is passed, this value is ignored.
      enhanced_count_thres: 80 # If the number of segments is lower than this number, enhanced speaker counting is activated.
      max_rp_threshold: 0.25 # Determines the range of p-value search: 0 < p <= max_rp_threshold. 
      sparse_search_volume: 30 # The higher the number, the more values will be examined with more time. 
      maj_vote_spk_count: False  # If True, take a majority vote on multiple p-values to estimate the number of speakers.

# json manifest line example
# {"audio_filepath": "/path/to/audio_file", "offset": 0, "duration": null, "label": "infer", "text": "-", "num_speakers": null, "rttm_filepath": "/path/to/rttm/file", "uem_filepath": "/path/to/uem/filepath"}

Expected behavior

I hope to finish the inference.

Environment overview

Environment details

Additional context I have not prepared Megatron GPT & numba Moreover, I only use the aepyx.wav in the Voxconverse test.

alamnasim commented 2 years ago

Please refer this issue: https://github.com/NVIDIA/NeMo/issues/4157 I also faced this issue its a VAD model issue, if you use VAD_Telephony_marblenet.nemo. It will work without any error.

When I run from base (no virtual environment) on some machine it run well even with marblenet only.

You can run notebook tutorial it will work. https://github.com/NVIDIA/NeMo/blob/main/tutorials/speaker_tasks/Speaker_Diarization_Inference.ipynb

triumph9989 commented 2 years ago

@alamnasim Yes. I have tried to replace VAD with Telephony_marblenet and it works for me. I really appreciate your help : )

nithinraok commented 2 years ago

@fayejf FYI

skanda1005 commented 2 years ago

I used VAD_Telephony_marblenet but still it gets stuck at generating predictions. Any solutions?

nithinraok commented 2 years ago

For temporary fix, change num_workers to 1.

Anna-Pinewood commented 9 months ago

I have tried to run it on Ubuntu 22.04.3 LTS with 8 cpu cores. When i changedconfig.num_workers = 8 to config.num_workers = 1 the freeze has gone and it worked. I did not change vad model, still works with pretrained_vad = 'vad_multilingual_marblenet'.