MahmoudAshraf97 / whisper-diarization

Automatic Speech Recognition with Speaker Diarization based on OpenAI Whisper
BSD 2-Clause "Simplified" License
3.7k stars 325 forks source link

Doesn't it provide Diarization for Korean? #200

Closed Dhyungsuk closed 4 months ago

Dhyungsuk commented 4 months ago

Hi, thank you so much for working on, sharing a good project. Could you please help me use this project?

This is an error I experienced.

> python diarize_parallel.py -a audio.mp4 --whisper-model large-v3 --language ko --batch-size 16

/root/.cache/pypoetry/virtualenvs/whisper-diarization-u5dY2iB5-py3.10/lib/python3.10/site-packages/pyannote/audio/core/io.py:43: UserWarning: torchaudio._backend.set_audio_backend has been deprecated. With dispatcher enabled, this function is no-op. You can remove the function call.
  torchaudio.set_audio_backend("soundfile")
torchvision is not available - cannot save figures
No language specified, language will be first be detected for each audio file (increases inference time).
Lightning automatically upgraded your loaded checkpoint from v1.5.4 to v1.9.4. To apply the upgrade to your files permanently, run `python -m pytorch_lightning.utilities.upgrade_checkpoint --file ../../../root/.cache/torch/whisperx-vad-segmentation.bin`
Model was trained with pyannote.audio 0.0.1, yours is 3.1.1. Bad things might happen unless you revert pyannote.audio to 0.x.
Model was trained with torch 1.10.0+cu102, yours is 2.1.2+cu121. Bad things might happen unless you revert torch to 1.x.
[NeMo W 2024-07-05 15:16:14 nemo_logging:349] /root/.cache/pypoetry/virtualenvs/whisper-diarization-u5dY2iB5-py3.10/lib/python3.10/site-packages/pyannote/audio/core/io.py:43: UserWarning: torchaudio._backend.set_audio_backend has been deprecated. With dispatcher enabled, this function is no-op. You can remove the function call.
      torchaudio.set_audio_backend("soundfile")

torchvision is not available - cannot save figures
[NeMo I 2024-07-05 15:16:15 msdd_models:1092] Loading pretrained diar_msdd_telephonic model from NGC
[NeMo I 2024-07-05 15:16:15 cloud:58] Found existing object /root/.cache/torch/NeMo/NeMo_1.20.0/diar_msdd_telephonic/3c3697a0a46f945574fa407149975a13/diar_msdd_telephonic.nemo.
[NeMo I 2024-07-05 15:16:15 cloud:64] Re-using file from: /root/.cache/torch/NeMo/NeMo_1.20.0/diar_msdd_telephonic/3c3697a0a46f945574fa407149975a13/diar_msdd_telephonic.nemo
[NeMo I 2024-07-05 15:16:15 common:913] Instantiating model from pre-trained checkpoint
[NeMo W 2024-07-05 15:16:16 modelPT:161] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader.
    Train config : 
    manifest_filepath: null
    emb_dir: null
    sample_rate: 16000
    num_spks: 2
    soft_label_thres: 0.5
    labels: null
    batch_size: 15
    emb_batch_size: 0
    shuffle: true

[NeMo W 2024-07-05 15:16:16 modelPT:168] If you intend to do validation, please call the ModelPT.setup_validation_data() or ModelPT.setup_multiple_validation_data() method and provide a valid configuration file to setup the validation data loader(s). 
    Validation config : 
    manifest_filepath: null
    emb_dir: null
    sample_rate: 16000
    num_spks: 2
    soft_label_thres: 0.5
    labels: null
    batch_size: 15
    emb_batch_size: 0
    shuffle: false

[NeMo W 2024-07-05 15:16:16 modelPT:174] Please call the ModelPT.setup_test_data() or ModelPT.setup_multiple_test_data() method and provide a valid configuration file to setup the test data loader(s).
    Test config : 
    manifest_filepath: null
    emb_dir: null
    sample_rate: 16000
    num_spks: 2
    soft_label_thres: 0.5
    labels: null
    batch_size: 15
    emb_batch_size: 0
    shuffle: false
    seq_eval_mode: false

[NeMo I 2024-07-05 15:16:16 features:289] PADDING: 16
[NeMo I 2024-07-05 15:16:16 features:289] PADDING: 16
[NeMo I 2024-07-05 15:16:16 save_restore_connector:249] Model EncDecDiarLabelModel was successfully restored from /root/.cache/torch/NeMo/NeMo_1.20.0/diar_msdd_telephonic/3c3697a0a46f945574fa407149975a13/diar_msdd_telephonic.nemo.
[NeMo I 2024-07-05 15:16:16 features:289] PADDING: 16
[NeMo I 2024-07-05 15:16:17 clustering_diarizer:127] Loading pretrained vad_multilingual_marblenet model from NGC
[NeMo I 2024-07-05 15:16:17 cloud:58] Found existing object /root/.cache/torch/NeMo/NeMo_1.20.0/vad_multilingual_marblenet/670f425c7f186060b7a7268ba6dfacb2/vad_multilingual_marblenet.nemo.
[NeMo I 2024-07-05 15:16:17 cloud:64] Re-using file from: /root/.cache/torch/NeMo/NeMo_1.20.0/vad_multilingual_marblenet/670f425c7f186060b7a7268ba6dfacb2/vad_multilingual_marblenet.nemo
[NeMo I 2024-07-05 15:16:17 common:913] Instantiating model from pre-trained checkpoint
[NeMo W 2024-07-05 15:16:17 modelPT:161] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader.
    Train config : 
    manifest_filepath: /manifests/ami_train_0.63.json,/manifests/freesound_background_train.json,/manifests/freesound_laughter_train.json,/manifests/fisher_2004_background.json,/manifests/fisher_2004_speech_sampled.json,/manifests/google_train_manifest.json,/manifests/icsi_all_0.63.json,/manifests/musan_freesound_train.json,/manifests/musan_music_train.json,/manifests/musan_soundbible_train.json,/manifests/mandarin_train_sample.json,/manifests/german_train_sample.json,/manifests/spanish_train_sample.json,/manifests/french_train_sample.json,/manifests/russian_train_sample.json
    sample_rate: 16000
    labels:
    - background
    - speech
    batch_size: 256
    shuffle: true
    is_tarred: false
    tarred_audio_filepaths: null
    tarred_shard_strategy: scatter
    augmentor:
      shift:
        prob: 0.5
        min_shift_ms: -10.0
        max_shift_ms: 10.0
      white_noise:
        prob: 0.5
        min_level: -90
        max_level: -46
        norm: true
      noise:
        prob: 0.5
        manifest_path: /manifests/noise_0_1_musan_fs.json
        min_snr_db: 0
        max_snr_db: 30
        max_gain_db: 300.0
        norm: true
      gain:
        prob: 0.5
        min_gain_dbfs: -10.0
        max_gain_dbfs: 10.0
        norm: true
    num_workers: 16
    pin_memory: true

[NeMo W 2024-07-05 15:16:17 modelPT:168] If you intend to do validation, please call the ModelPT.setup_validation_data() or ModelPT.setup_multiple_validation_data() method and provide a valid configuration file to setup the validation data loader(s). 
    Validation config : 
    manifest_filepath: /manifests/ami_dev_0.63.json,/manifests/freesound_background_dev.json,/manifests/freesound_laughter_dev.json,/manifests/ch120_moved_0.63.json,/manifests/fisher_2005_500_speech_sampled.json,/manifests/google_dev_manifest.json,/manifests/musan_music_dev.json,/manifests/mandarin_dev.json,/manifests/german_dev.json,/manifests/spanish_dev.json,/manifests/french_dev.json,/manifests/russian_dev.json
    sample_rate: 16000
    labels:
    - background
    - speech
    batch_size: 256
    shuffle: false
    val_loss_idx: 0
    num_workers: 16
    pin_memory: true

[NeMo W 2024-07-05 15:16:17 modelPT:174] Please call the ModelPT.setup_test_data() or ModelPT.setup_multiple_test_data() method and provide a valid configuration file to setup the test data loader(s).
    Test config : 
    manifest_filepath: null
    sample_rate: 16000
    labels:
    - background
    - speech
    batch_size: 128
    shuffle: false
    test_loss_idx: 0

[NeMo I 2024-07-05 15:16:17 features:289] PADDING: 16
[NeMo I 2024-07-05 15:16:17 save_restore_connector:249] Model EncDecClassificationModel was successfully restored from /root/.cache/torch/NeMo/NeMo_1.20.0/vad_multilingual_marblenet/670f425c7f186060b7a7268ba6dfacb2/vad_multilingual_marblenet.nemo.
[NeMo I 2024-07-05 15:16:17 msdd_models:864] Multiscale Weights: [1, 1, 1, 1, 1]
[NeMo I 2024-07-05 15:16:17 msdd_models:865] Clustering Parameters: {
        "oracle_num_speakers": false,
        "max_num_speakers": 8,
        "enhanced_count_thres": 80,
        "max_rp_threshold": 0.25,
        "sparse_search_volume": 30,
        "maj_vote_spk_count": false,
        "chunk_cluster_count": 50,
        "embeddings_per_chunk": 10000
    }
[NeMo I 2024-07-05 15:16:17 speaker_utils:93] Number of files to diarize: 1
[NeMo I 2024-07-05 15:16:17 clustering_diarizer:309] Split long audio file to avoid CUDA memory issue
splitting manifest: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  1.21it/s]
[NeMo I 2024-07-05 15:16:18 classification_models:272] Perform streaming frame-level VAD
[NeMo I 2024-07-05 15:16:18 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-07-05 15:16:18 collections:302] Dataset loaded with 6 items, total duration of  0.08 hours.
[NeMo I 2024-07-05 15:16:18 collections:304] # 6 files loaded accounting to # 1 labels
vad: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 6/6 [00:00<00:00,  6.77it/s]
[NeMo I 2024-07-05 15:16:19 clustering_diarizer:250] Generating predictions with overlapping input segments
[NeMo I 2024-07-05 15:16:20 clustering_diarizer:262] Converting frame level prediction to speech/no-speech segment in start and end times format.                                                                                                                                                                
creating speech segments: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 11.86it/s]
[NeMo I 2024-07-05 15:16:20 clustering_diarizer:287] Subsegmentation for embedding extraction: scale0, /code/ihyungsuk/whisper-diarization/temp_outputs/speaker_outputs/subsegments_scale0.json
[NeMo I 2024-07-05 15:16:20 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2024-07-05 15:16:20 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-07-05 15:16:20 collections:302] Dataset loaded with 125 items, total duration of  0.02 hours.
[NeMo I 2024-07-05 15:16:20 collections:304] # 125 files loaded accounting to # 1 labels
[1/5] extract embeddings: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 10.00it/s]
[NeMo I 2024-07-05 15:16:20 clustering_diarizer:389] Saved embedding files to /code/ihyungsuk/whisper-diarization/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-07-05 15:16:20 clustering_diarizer:287] Subsegmentation for embedding extraction: scale1, /code/ihyungsuk/whisper-diarization/temp_outputs/speaker_outputs/subsegments_scale1.json
[NeMo I 2024-07-05 15:16:20 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2024-07-05 15:16:20 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-07-05 15:16:20 collections:302] Dataset loaded with 128 items, total duration of  0.02 hours.
[NeMo I 2024-07-05 15:16:20 collections:304] # 128 files loaded accounting to # 1 labels
[2/5] extract embeddings: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 13.58it/s]
[NeMo I 2024-07-05 15:16:20 clustering_diarizer:389] Saved embedding files to /code/ihyungsuk/whisper-diarization/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-07-05 15:16:20 clustering_diarizer:287] Subsegmentation for embedding extraction: scale2, /code/ihyungsuk/whisper-diarization/temp_outputs/speaker_outputs/subsegments_scale2.json
[NeMo I 2024-07-05 15:16:20 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2024-07-05 15:16:20 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-07-05 15:16:20 collections:302] Dataset loaded with 141 items, total duration of  0.02 hours.
[NeMo I 2024-07-05 15:16:20 collections:304] # 141 files loaded accounting to # 1 labels
[3/5] extract embeddings: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 18.06it/s]
[NeMo I 2024-07-05 15:16:20 clustering_diarizer:389] Saved embedding files to /code/ihyungsuk/whisper-diarization/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-07-05 15:16:20 clustering_diarizer:287] Subsegmentation for embedding extraction: scale3, /code/ihyungsuk/whisper-diarization/temp_outputs/speaker_outputs/subsegments_scale3.json
[NeMo I 2024-07-05 15:16:20 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2024-07-05 15:16:20 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-07-05 15:16:20 collections:302] Dataset loaded with 157 items, total duration of  0.02 hours.
[NeMo I 2024-07-05 15:16:20 collections:304] # 157 files loaded accounting to # 1 labels
[4/5] extract embeddings: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 19.14it/s]
[NeMo I 2024-07-05 15:16:20 clustering_diarizer:389] Saved embedding files to /code/ihyungsuk/whisper-diarization/temp_outputs/speaker_outputs/embeddings
[NeMo I 2024-07-05 15:16:20 clustering_diarizer:287] Subsegmentation for embedding extraction: scale4, /code/ihyungsuk/whisper-diarization/temp_outputs/speaker_outputs/subsegments_scale4.json
[NeMo I 2024-07-05 15:16:20 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2024-07-05 15:16:20 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2024-07-05 15:16:20 collections:302] Dataset loaded with 209 items, total duration of  0.02 hours.
[NeMo I 2024-07-05 15:16:20 collections:304] # 209 files loaded accounting to # 1 labels
[5/5] extract embeddings: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:00<00:00, 23.91it/s]
[NeMo I 2024-07-05 15:16:21 clustering_diarizer:389] Saved embedding files to /code/ihyungsuk/whisper-diarization/temp_outputs/speaker_outputs/embeddings
clustering: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  4.77it/s]
[NeMo I 2024-07-05 15:16:21 clustering_diarizer:464] Outputs are saved in /code/ihyungsuk/whisper-diarization/temp_outputs directory
[NeMo W 2024-07-05 15:16:21 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate
[NeMo I 2024-07-05 15:16:21 msdd_models:960] Loading embedding pickle file of scale:0 at /code/ihyungsuk/whisper-diarization/temp_outputs/speaker_outputs/embeddings/subsegments_scale0_embeddings.pkl
[NeMo I 2024-07-05 15:16:21 msdd_models:960] Loading embedding pickle file of scale:1 at /code/ihyungsuk/whisper-diarization/temp_outputs/speaker_outputs/embeddings/subsegments_scale1_embeddings.pkl
[NeMo I 2024-07-05 15:16:21 msdd_models:960] Loading embedding pickle file of scale:2 at /code/ihyungsuk/whisper-diarization/temp_outputs/speaker_outputs/embeddings/subsegments_scale2_embeddings.pkl
[NeMo I 2024-07-05 15:16:21 msdd_models:960] Loading embedding pickle file of scale:3 at /code/ihyungsuk/whisper-diarization/temp_outputs/speaker_outputs/embeddings/subsegments_scale3_embeddings.pkl
[NeMo I 2024-07-05 15:16:21 msdd_models:960] Loading embedding pickle file of scale:4 at /code/ihyungsuk/whisper-diarization/temp_outputs/speaker_outputs/embeddings/subsegments_scale4_embeddings.pkl
[NeMo I 2024-07-05 15:16:21 msdd_models:938] Loading cluster label file from /code/ihyungsuk/whisper-diarization/temp_outputs/speaker_outputs/subsegments_scale4_cluster.label
[NeMo I 2024-07-05 15:16:21 collections:617] Filtered duration for loading collection is 0.000000.
[NeMo I 2024-07-05 15:16:21 collections:620] Total 3 session files loaded accounting to # 3 audio clips
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 44.72it/s]
[NeMo I 2024-07-05 15:16:21 msdd_models:1403]      [Threshold: 0.7000] [use_clus_as_main=False] [diar_window=50]
[NeMo I 2024-07-05 15:16:21 speaker_utils:93] Number of files to diarize: 1
[NeMo I 2024-07-05 15:16:21 speaker_utils:93] Number of files to diarize: 1
[NeMo W 2024-07-05 15:16:21 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate
[NeMo I 2024-07-05 15:16:21 speaker_utils:93] Number of files to diarize: 1
[NeMo W 2024-07-05 15:16:21 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate
[NeMo I 2024-07-05 15:16:21 speaker_utils:93] Number of files to diarize: 1
[NeMo W 2024-07-05 15:16:21 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate
[NeMo I 2024-07-05 15:16:21 msdd_models:1431]   

WARNING:root:Punctuation restoration is not available for ko language. Using the original punctuation.
MahmoudAshraf97 commented 4 months ago

Hi, the program finished successfully, I see no errors, most of them are warnings you can ignore

Dhyungsuk commented 4 months ago

Thank you so much!