Hi, thank you so much for working on, sharing a good project. Could you please help me use this project?
This is an error I experienced.
> python diarize_parallel.py -a audio.mp4 --whisper-model large-v3 --language ko --batch-size 16
/root/.cache/pypoetry/virtualenvs/whisper-diarization-u5dY2iB5-py3.10/lib/python3.10/site-packages/pyannote/audio/core/io.py:43: UserWarning: torchaudio._backend.set_audio_backend has been deprecated. With dispatcher enabled, this function is no-op. You can remove the function call.
torchaudio.set_audio_backend("soundfile")
torchvision is not available - cannot save figures
No language specified, language will be first be detected for each audio file (increases inference time).
Lightning automatically upgraded your loaded checkpoint from v1.5.4 to v1.9.4. To apply the upgrade to your files permanently, run `python -m pytorch_lightning.utilities.upgrade_checkpoint --file ../../../root/.cache/torch/whisperx-vad-segmentation.bin`
Model was trained with pyannote.audio 0.0.1, yours is 3.1.1. Bad things might happen unless you revert pyannote.audio to 0.x.
Model was trained with torch 1.10.0+cu102, yours is 2.1.2+cu121. Bad things might happen unless you revert torch to 1.x.
[NeMo W 2024-07-05 15:16:14 nemo_logging:349] /root/.cache/pypoetry/virtualenvs/whisper-diarization-u5dY2iB5-py3.10/lib/python3.10/site-packages/pyannote/audio/core/io.py:43: UserWarning: torchaudio._backend.set_audio_backend has been deprecated. With dispatcher enabled, this function is no-op. You can remove the function call.
torchaudio.set_audio_backend("soundfile")
torchvision is not available - cannot save figures
[NeMo I 2024-07-05 15:16:15 msdd_models:1092] Loading pretrained diar_msdd_telephonic model from NGC
[NeMo I 2024-07-05 15:16:15 cloud:58] Found existing object /root/.cache/torch/NeMo/NeMo_1.20.0/diar_msdd_telephonic/3c3697a0a46f945574fa407149975a13/diar_msdd_telephonic.nemo.
[NeMo I 2024-07-05 15:16:15 cloud:64] Re-using file from: /root/.cache/torch/NeMo/NeMo_1.20.0/diar_msdd_telephonic/3c3697a0a46f945574fa407149975a13/diar_msdd_telephonic.nemo
[NeMo I 2024-07-05 15:16:15 common:913] Instantiating model from pre-trained checkpoint
[NeMo W 2024-07-05 15:16:16 modelPT:161] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader.
Train config :
manifest_filepath: null
emb_dir: null
sample_rate: 16000
num_spks: 2
soft_label_thres: 0.5
labels: null
batch_size: 15
emb_batch_size: 0
shuffle: true
[NeMo W 2024-07-05 15:16:16 modelPT:168] If you intend to do validation, please call the ModelPT.setup_validation_data() or ModelPT.setup_multiple_validation_data() method and provide a valid configuration file to setup the validation data loader(s).
Validation config :
manifest_filepath: null
emb_dir: null
sample_rate: 16000
num_spks: 2
soft_label_thres: 0.5
labels: null
batch_size: 15
emb_batch_size: 0
shuffle: false
[NeMo W 2024-07-05 15:16:16 modelPT:174] Please call the ModelPT.setup_test_data() or ModelPT.setup_multiple_test_data() method and provide a valid configuration file to setup the test data loader(s).
Test config :
manifest_filepath: null
emb_dir: null
sample_rate: 16000
num_spks: 2
soft_label_thres: 0.5
labels: null
batch_size: 15
emb_batch_size: 0
shuffle: false
seq_eval_mode: false
[NeMo I 2024-07-05 15:16:16 features:289] PADDING: 16
[NeMo I 2024-07-05 15:16:16 features:289] PADDING: 16
[NeMo I 2024-07-05 15:16:16 save_restore_connector:249] Model EncDecDiarLabelModel was successfully restored from /root/.cache/torch/NeMo/NeMo_1.20.0/diar_msdd_telephonic/3c3697a0a46f945574fa407149975a13/diar_msdd_telephonic.nemo.
[NeMo I 2024-07-05 15:16:16 features:289] PADDING: 16
[NeMo I 2024-07-05 15:16:17 clustering_diarizer:127] Loading pretrained vad_multilingual_marblenet model from NGC
[NeMo I 2024-07-05 15:16:17 cloud:58] Found existing object /root/.cache/torch/NeMo/NeMo_1.20.0/vad_multilingual_marblenet/670f425c7f186060b7a7268ba6dfacb2/vad_multilingual_marblenet.nemo.
[NeMo I 2024-07-05 15:16:17 cloud:64] Re-using file from: /root/.cache/torch/NeMo/NeMo_1.20.0/vad_multilingual_marblenet/670f425c7f186060b7a7268ba6dfacb2/vad_multilingual_marblenet.nemo
[NeMo I 2024-07-05 15:16:17 common:913] Instantiating model from pre-trained checkpoint
[NeMo W 2024-07-05 15:16:17 modelPT:161] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader.
Train config :
manifest_filepath: /manifests/ami_train_0.63.json,/manifests/freesound_background_train.json,/manifests/freesound_laughter_train.json,/manifests/fisher_2004_background.json,/manifests/fisher_2004_speech_sampled.json,/manifests/google_train_manifest.json,/manifests/icsi_all_0.63.json,/manifests/musan_freesound_train.json,/manifests/musan_music_train.json,/manifests/musan_soundbible_train.json,/manifests/mandarin_train_sample.json,/manifests/german_train_sample.json,/manifests/spanish_train_sample.json,/manifests/french_train_sample.json,/manifests/russian_train_sample.json
sample_rate: 16000
labels:
- background
- speech
batch_size: 256
shuffle: true
is_tarred: false
tarred_audio_filepaths: null
tarred_shard_strategy: scatter
augmentor:
shift:
prob: 0.5
min_shift_ms: -10.0
max_shift_ms: 10.0
white_noise:
prob: 0.5
min_level: -90
max_level: -46
norm: true
noise:
prob: 0.5
manifest_path: /manifests/noise_0_1_musan_fs.json
min_snr_db: 0
max_snr_db: 30
max_gain_db: 300.0
norm: true
gain:
prob: 0.5
min_gain_dbfs: -10.0
max_gain_dbfs: 10.0
norm: true
num_workers: 16
pin_memory: true
[NeMo W 2024-07-05 15:16:17 modelPT:168] If you intend to do validation, please call the ModelPT.setup_validation_data() or ModelPT.setup_multiple_validation_data() method and provide a valid configuration file to setup the validation data loader(s).
Validation config :
manifest_filepath: /manifests/ami_dev_0.63.json,/manifests/freesound_background_dev.json,/manifests/freesound_laughter_dev.json,/manifests/ch120_moved_0.63.json,/manifests/fisher_2005_500_speech_sampled.json,/manifests/google_dev_manifest.json,/manifests/musan_music_dev.json,/manifests/mandarin_dev.json,/manifests/german_dev.json,/manifests/spanish_dev.json,/manifests/french_dev.json,/manifests/russian_dev.json
sample_rate: 16000
labels:
- background
- speech
batch_size: 256
shuffle: false
val_loss_idx: 0
num_workers: 16
pin_memory: true
[NeMo W 2024-07-05 15:16:17 modelPT:174] Please call the ModelPT.setup_test_data() or ModelPT.setup_multiple_test_data() method and provide a valid configuration file to setup the test data loader(s).
Test config :
manifest_filepath: null
sample_rate: 16000
labels:
- background
- speech
batch_size: 128
shuffle: false
test_loss_idx: 0
[NeMo I 2024-07-05 15:16:17 features:289] PADDING: 16
[NeMo I 2024-07-05 15:16:17 save_restore_connector:249] Model EncDecClassificationModel was successfully restored from /root/.cache/torch/NeMo/NeMo_1.20.0/vad_multilingual_marblenet/670f425c7f186060b7a7268ba6dfacb2/vad_multilingual_marblenet.nemo.
[NeMo I 2024-07-05 15:16:17 msdd_models:864] Multiscale Weights: [1, 1, 1, 1, 1]
[NeMo I 2024-07-05 15:16:17 msdd_models:865] Clustering Parameters: {
"oracle_num_speakers": false,
"max_num_speakers": 8,
"enhanced_count_thres": 80,
"max_rp_threshold": 0.25,
"sparse_search_volume": 30,
"maj_vote_spk_count": false,
"chunk_cluster_count": 50,
"embeddings_per_chunk": 10000
}
[NeMo I 2024-07-05 15:16:17 speaker_utils:93] Number of files to diarize: 1
[NeMo I 2024-07-05 15:16:17 clustering_diarizer:309] Split long audio file to avoid CUDA memory issue
splitting manifest: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 1.21it/s]
[NeMo I 2024-07-05 15:16:18 classification_models:272] Perform streaming frame-level VAD
[NeMo I 2024-07-05 15:16:18 collections:301] Filtered duration for loading collection is 0.00 hours.
[NeMo I 2024-07-05 15:16:18 collections:302] Dataset loaded with 6 items, total duration of 0.08 hours.
[NeMo I 2024-07-05 15:16:18 collections:304] # 6 files loaded accounting to # 1 labels
vad: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 6/6 [00:00<00:00, 6.77it/s]
[NeMo I 2024-07-05 15:16:19 clustering_diarizer:250] Generating predictions with overlapping input segments
[NeMo I 2024-07-05 15:16:20 clustering_diarizer:262] Converting frame level prediction to speech/no-speech segment in start and end times format.
creating speech segments: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 11.86it/s]
[NeMo I 2024-07-05 15:16:20 clustering_diarizer:287] Subsegmentation for embedding extraction: scale0, /code/ihyungsuk/whisper-diarization/temp_outputs/speaker_outputs/subsegments_scale0.json
[NeMo I 2024-07-05 15:16:20 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2024-07-05 15:16:20 collections:301] Filtered duration for loading collection is 0.00 hours.
[NeMo I 2024-07-05 15:16:20 collections:302] Dataset loaded with 125 items, total duration of 0.02 hours.
[NeMo I 2024-07-05 15:16:20 collections:304] # 125 files loaded accounting to # 1 labels
[1/5] extract embeddings: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 10.00it/s]
[NeMo I 2024-07-05 15:16:20 clustering_diarizer:389] Saved embedding files to /code/ihyungsuk/whisper-diarization/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-07-05 15:16:20 clustering_diarizer:287] Subsegmentation for embedding extraction: scale1, /code/ihyungsuk/whisper-diarization/temp_outputs/speaker_outputs/subsegments_scale1.json
[NeMo I 2024-07-05 15:16:20 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2024-07-05 15:16:20 collections:301] Filtered duration for loading collection is 0.00 hours.
[NeMo I 2024-07-05 15:16:20 collections:302] Dataset loaded with 128 items, total duration of 0.02 hours.
[NeMo I 2024-07-05 15:16:20 collections:304] # 128 files loaded accounting to # 1 labels
[2/5] extract embeddings: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 13.58it/s]
[NeMo I 2024-07-05 15:16:20 clustering_diarizer:389] Saved embedding files to /code/ihyungsuk/whisper-diarization/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-07-05 15:16:20 clustering_diarizer:287] Subsegmentation for embedding extraction: scale2, /code/ihyungsuk/whisper-diarization/temp_outputs/speaker_outputs/subsegments_scale2.json
[NeMo I 2024-07-05 15:16:20 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2024-07-05 15:16:20 collections:301] Filtered duration for loading collection is 0.00 hours.
[NeMo I 2024-07-05 15:16:20 collections:302] Dataset loaded with 141 items, total duration of 0.02 hours.
[NeMo I 2024-07-05 15:16:20 collections:304] # 141 files loaded accounting to # 1 labels
[3/5] extract embeddings: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 18.06it/s]
[NeMo I 2024-07-05 15:16:20 clustering_diarizer:389] Saved embedding files to /code/ihyungsuk/whisper-diarization/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-07-05 15:16:20 clustering_diarizer:287] Subsegmentation for embedding extraction: scale3, /code/ihyungsuk/whisper-diarization/temp_outputs/speaker_outputs/subsegments_scale3.json
[NeMo I 2024-07-05 15:16:20 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2024-07-05 15:16:20 collections:301] Filtered duration for loading collection is 0.00 hours.
[NeMo I 2024-07-05 15:16:20 collections:302] Dataset loaded with 157 items, total duration of 0.02 hours.
[NeMo I 2024-07-05 15:16:20 collections:304] # 157 files loaded accounting to # 1 labels
[4/5] extract embeddings: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 19.14it/s]
[NeMo I 2024-07-05 15:16:20 clustering_diarizer:389] Saved embedding files to /code/ihyungsuk/whisper-diarization/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-07-05 15:16:20 clustering_diarizer:287] Subsegmentation for embedding extraction: scale4, /code/ihyungsuk/whisper-diarization/temp_outputs/speaker_outputs/subsegments_scale4.json
[NeMo I 2024-07-05 15:16:20 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2024-07-05 15:16:20 collections:301] Filtered duration for loading collection is 0.00 hours.
[NeMo I 2024-07-05 15:16:20 collections:302] Dataset loaded with 209 items, total duration of 0.02 hours.
[NeMo I 2024-07-05 15:16:20 collections:304] # 209 files loaded accounting to # 1 labels
[5/5] extract embeddings: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:00<00:00, 23.91it/s]
[NeMo I 2024-07-05 15:16:21 clustering_diarizer:389] Saved embedding files to /code/ihyungsuk/whisper-diarization/temp_outputs/speaker_outputs/embeddings
clustering: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 4.77it/s]
[NeMo I 2024-07-05 15:16:21 clustering_diarizer:464] Outputs are saved in /code/ihyungsuk/whisper-diarization/temp_outputs directory
[NeMo W 2024-07-05 15:16:21 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate
[NeMo I 2024-07-05 15:16:21 msdd_models:960] Loading embedding pickle file of scale:0 at /code/ihyungsuk/whisper-diarization/temp_outputs/speaker_outputs/embeddings/subsegments_scale0_embeddings.pkl
[NeMo I 2024-07-05 15:16:21 msdd_models:960] Loading embedding pickle file of scale:1 at /code/ihyungsuk/whisper-diarization/temp_outputs/speaker_outputs/embeddings/subsegments_scale1_embeddings.pkl
[NeMo I 2024-07-05 15:16:21 msdd_models:960] Loading embedding pickle file of scale:2 at /code/ihyungsuk/whisper-diarization/temp_outputs/speaker_outputs/embeddings/subsegments_scale2_embeddings.pkl
[NeMo I 2024-07-05 15:16:21 msdd_models:960] Loading embedding pickle file of scale:3 at /code/ihyungsuk/whisper-diarization/temp_outputs/speaker_outputs/embeddings/subsegments_scale3_embeddings.pkl
[NeMo I 2024-07-05 15:16:21 msdd_models:960] Loading embedding pickle file of scale:4 at /code/ihyungsuk/whisper-diarization/temp_outputs/speaker_outputs/embeddings/subsegments_scale4_embeddings.pkl
[NeMo I 2024-07-05 15:16:21 msdd_models:938] Loading cluster label file from /code/ihyungsuk/whisper-diarization/temp_outputs/speaker_outputs/subsegments_scale4_cluster.label
[NeMo I 2024-07-05 15:16:21 collections:617] Filtered duration for loading collection is 0.000000.
[NeMo I 2024-07-05 15:16:21 collections:620] Total 3 session files loaded accounting to # 3 audio clips
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 44.72it/s]
[NeMo I 2024-07-05 15:16:21 msdd_models:1403] [Threshold: 0.7000] [use_clus_as_main=False] [diar_window=50]
[NeMo I 2024-07-05 15:16:21 speaker_utils:93] Number of files to diarize: 1
[NeMo I 2024-07-05 15:16:21 speaker_utils:93] Number of files to diarize: 1
[NeMo W 2024-07-05 15:16:21 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate
[NeMo I 2024-07-05 15:16:21 speaker_utils:93] Number of files to diarize: 1
[NeMo W 2024-07-05 15:16:21 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate
[NeMo I 2024-07-05 15:16:21 speaker_utils:93] Number of files to diarize: 1
[NeMo W 2024-07-05 15:16:21 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate
[NeMo I 2024-07-05 15:16:21 msdd_models:1431]
WARNING:root:Punctuation restoration is not available for ko language. Using the original punctuation.
Hi, thank you so much for working on, sharing a good project. Could you please help me use this project?
This is an error I experienced.