MahmoudAshraf97 / whisper-diarization

Automatic Speech Recognition with Speaker Diarization based on OpenAI Whisper
BSD 2-Clause "Simplified" License
3.67k stars 324 forks source link

msdd_model.diarize() RuntimeError: shape '[138, 50, 16, 192]' is invalid for input of size 84787200 #130

Open Ko8rah opened 11 months ago

Ko8rah commented 11 months ago

Hello,

I have an issue while running the notebook with the msdd_model.diarize() method:

[NeMo I 2023-11-24 10:01:36 msdd_models:1092] Loading pretrained diar_msdd_telephonic model from NGC
[NeMo I 2023-11-24 10:01:36 cloud:58] Found existing object /root/.cache/torch/NeMo/NeMo_1.20.0/diar_msdd_telephonic/3c3697a0a46f945574fa407149975a13/diar_msdd_telephonic.nemo.
[NeMo I 2023-11-24 10:01:36 cloud:64] Re-using file from: /root/.cache/torch/NeMo/NeMo_1.20.0/diar_msdd_telephonic/3c3697a0a46f945574fa407149975a13/diar_msdd_telephonic.nemo
[NeMo I 2023-11-24 10:01:36 common:913] Instantiating model from pre-trained checkpoint
[NeMo W 2023-11-24 10:01:38 modelPT:161] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader.
    Train config : 
    manifest_filepath: null
    emb_dir: null
    sample_rate: 16000
    num_spks: 2
    soft_label_thres: 0.5
    labels: null
    batch_size: 15
    emb_batch_size: 0
    shuffle: true

[NeMo W 2023-11-24 10:01:38 modelPT:168] If you intend to do validation, please call the ModelPT.setup_validation_data() or ModelPT.setup_multiple_validation_data() method and provide a valid configuration file to setup the validation data loader(s). 
    Validation config : 
    manifest_filepath: null
    emb_dir: null
    sample_rate: 16000
    num_spks: 2
    soft_label_thres: 0.5
    labels: null
    batch_size: 15
    emb_batch_size: 0
    shuffle: false

[NeMo W 2023-11-24 10:01:38 modelPT:174] Please call the ModelPT.setup_test_data() or ModelPT.setup_multiple_test_data() method and provide a valid configuration file to setup the test data loader(s).
    Test config : 
    manifest_filepath: null
    emb_dir: null
    sample_rate: 16000
    num_spks: 2
    soft_label_thres: 0.5
    labels: null
    batch_size: 15
    emb_batch_size: 0
    shuffle: false
    seq_eval_mode: false

[NeMo I 2023-11-24 10:01:38 features:289] PADDING: 16
[NeMo I 2023-11-24 10:01:38 features:289] PADDING: 16
[NeMo I 2023-11-24 10:01:39 save_restore_connector:249] Model EncDecDiarLabelModel was successfully restored from /root/.cache/torch/NeMo/NeMo_1.20.0/diar_msdd_telephonic/3c3697a0a46f945574fa407149975a13/diar_msdd_telephonic.nemo.
[NeMo I 2023-11-24 10:01:39 features:289] PADDING: 16
[NeMo I 2023-11-24 10:01:40 clustering_diarizer:127] Loading pretrained vad_multilingual_marblenet model from NGC
[NeMo I 2023-11-24 10:01:40 cloud:58] Found existing object /root/.cache/torch/NeMo/NeMo_1.20.0/vad_multilingual_marblenet/670f425c7f186060b7a7268ba6dfacb2/vad_multilingual_marblenet.nemo.
[NeMo I 2023-11-24 10:01:40 cloud:64] Re-using file from: /root/.cache/torch/NeMo/NeMo_1.20.0/vad_multilingual_marblenet/670f425c7f186060b7a7268ba6dfacb2/vad_multilingual_marblenet.nemo
[NeMo I 2023-11-24 10:01:40 common:913] Instantiating model from pre-trained checkpoint
[NeMo W 2023-11-24 10:01:40 modelPT:161] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader.
    Train config : 
    manifest_filepath: /manifests/ami_train_0.63.json,/manifests/freesound_background_train.json,/manifests/freesound_laughter_train.json,/manifests/fisher_2004_background.json,/manifests/fisher_2004_speech_sampled.json,/manifests/google_train_manifest.json,/manifests/icsi_all_0.63.json,/manifests/musan_freesound_train.json,/manifests/musan_music_train.json,/manifests/musan_soundbible_train.json,/manifests/mandarin_train_sample.json,/manifests/german_train_sample.json,/manifests/spanish_train_sample.json,/manifests/french_train_sample.json,/manifests/russian_train_sample.json
    sample_rate: 16000
    labels:
    - background
    - speech
    batch_size: 256
    shuffle: true
    is_tarred: false
    tarred_audio_filepaths: null
    tarred_shard_strategy: scatter
    augmentor:
      shift:
        prob: 0.5
        min_shift_ms: -10.0
        max_shift_ms: 10.0
      white_noise:
        prob: 0.5
        min_level: -90
        max_level: -46
        norm: true
      noise:
        prob: 0.5
        manifest_path: /manifests/noise_0_1_musan_fs.json
        min_snr_db: 0
        max_snr_db: 30
        max_gain_db: 300.0
        norm: true
      gain:
        prob: 0.5
        min_gain_dbfs: -10.0
        max_gain_dbfs: 10.0
        norm: true
    num_workers: 16
    pin_memory: true

[NeMo W 2023-11-24 10:01:40 modelPT:168] If you intend to do validation, please call the ModelPT.setup_validation_data() or ModelPT.setup_multiple_validation_data() method and provide a valid configuration file to setup the validation data loader(s). 
    Validation config : 
    manifest_filepath: /manifests/ami_dev_0.63.json,/manifests/freesound_background_dev.json,/manifests/freesound_laughter_dev.json,/manifests/ch120_moved_0.63.json,/manifests/fisher_2005_500_speech_sampled.json,/manifests/google_dev_manifest.json,/manifests/musan_music_dev.json,/manifests/mandarin_dev.json,/manifests/german_dev.json,/manifests/spanish_dev.json,/manifests/french_dev.json,/manifests/russian_dev.json
    sample_rate: 16000
    labels:
    - background
    - speech
    batch_size: 256
    shuffle: false
    val_loss_idx: 0
    num_workers: 16
    pin_memory: true

[NeMo W 2023-11-24 10:01:40 modelPT:174] Please call the ModelPT.setup_test_data() or ModelPT.setup_multiple_test_data() method and provide a valid configuration file to setup the test data loader(s).
    Test config : 
    manifest_filepath: null
    sample_rate: 16000
    labels:
    - background
    - speech
    batch_size: 128
    shuffle: false
    test_loss_idx: 0

[NeMo I 2023-11-24 10:01:40 features:289] PADDING: 16
[NeMo I 2023-11-24 10:01:40 save_restore_connector:249] Model EncDecClassificationModel was successfully restored from /root/.cache/torch/NeMo/NeMo_1.20.0/vad_multilingual_marblenet/670f425c7f186060b7a7268ba6dfacb2/vad_multilingual_marblenet.nemo.
[NeMo I 2023-11-24 10:01:40 msdd_models:864] Multiscale Weights: [1, 1, 1, 1, 1, 1]
[NeMo I 2023-11-24 10:01:40 msdd_models:865] Clustering Parameters: {
        "oracle_num_speakers": false,
        "max_num_speakers": 8,
        "enhanced_count_thres": 80,
        "max_rp_threshold": 0.25,
        "sparse_search_volume": 30,
        "maj_vote_spk_count": false,
        "chunk_cluster_count": 50,
        "embeddings_per_chunk": 10000
    }
[NeMo W 2023-11-24 10:01:40 clustering_diarizer:411] Deleting previous clustering diarizer outputs.
[NeMo I 2023-11-24 10:01:40 speaker_utils:93] Number of files to diarize: 1
[NeMo I 2023-11-24 10:01:40 clustering_diarizer:309] Split long audio file to avoid CUDA memory issue
splitting manifest: 100%|██████████| 1/1 [00:00<00:00,  1.88it/s][NeMo I 2023-11-24 10:01:41 vad_utils:107] The prepared manifest file exists. Overwriting!
[NeMo I 2023-11-24 10:01:41 classification_models:272] Perform streaming frame-level VAD
[NeMo I 2023-11-24 10:01:41 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2023-11-24 10:01:41 collections:302] Dataset loaded with 12 items, total duration of  0.16 hours.
[NeMo I 2023-11-24 10:01:41 collections:304] # 12 files loaded accounting to # 1 labels

vad: 100%|██████████| 12/12 [00:04<00:00,  2.46it/s][NeMo I 2023-11-24 10:01:46 clustering_diarizer:262] Converting frame level prediction to speech/no-speech segment in start and end times format.

creating speech segments: 100%|██████████| 1/1 [00:00<00:00,  2.43it/s][NeMo I 2023-11-24 10:01:46 clustering_diarizer:287] Subsegmentation for embedding extraction: scale0, /content/temp_outputs/speaker_outputs/subsegments_scale0.json
[NeMo I 2023-11-24 10:01:46 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2023-11-24 10:01:46 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2023-11-24 10:01:46 collections:302] Dataset loaded with 381 items, total duration of  0.32 hours.
[NeMo I 2023-11-24 10:01:46 collections:304] # 381 files loaded accounting to # 1 labels

[1/6] extract embeddings: 100%|██████████| 6/6 [00:01<00:00,  3.13it/s][NeMo I 2023-11-24 10:01:48 clustering_diarizer:389] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2023-11-24 10:01:48 clustering_diarizer:287] Subsegmentation for embedding extraction: scale1, /content/temp_outputs/speaker_outputs/subsegments_scale1.json
[NeMo I 2023-11-24 10:01:48 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2023-11-24 10:01:48 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2023-11-24 10:01:48 collections:302] Dataset loaded with 457 items, total duration of  0.32 hours.
[NeMo I 2023-11-24 10:01:48 collections:304] # 457 files loaded accounting to # 1 labels

[2/6] extract embeddings: 100%|██████████| 8/8 [00:02<00:00,  3.86it/s][NeMo I 2023-11-24 10:01:50 clustering_diarizer:389] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2023-11-24 10:01:50 clustering_diarizer:287] Subsegmentation for embedding extraction: scale2, /content/temp_outputs/speaker_outputs/subsegments_scale2.json
[NeMo I 2023-11-24 10:01:50 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2023-11-24 10:01:50 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2023-11-24 10:01:50 collections:302] Dataset loaded with 574 items, total duration of  0.32 hours.
[NeMo I 2023-11-24 10:01:50 collections:304] # 574 files loaded accounting to # 1 labels

[3/6] extract embeddings: 100%|██████████| 9/9 [00:02<00:00,  3.30it/s][NeMo I 2023-11-24 10:01:53 clustering_diarizer:389] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2023-11-24 10:01:53 clustering_diarizer:287] Subsegmentation for embedding extraction: scale3, /content/temp_outputs/speaker_outputs/subsegments_scale3.json
[NeMo I 2023-11-24 10:01:53 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2023-11-24 10:01:53 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2023-11-24 10:01:53 collections:302] Dataset loaded with 764 items, total duration of  0.32 hours.
[NeMo I 2023-11-24 10:01:53 collections:304] # 764 files loaded accounting to # 1 labels

[4/6] extract embeddings: 100%|██████████| 12/12 [00:03<00:00,  3.92it/s][NeMo I 2023-11-24 10:01:56 clustering_diarizer:389] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2023-11-24 10:01:56 clustering_diarizer:287] Subsegmentation for embedding extraction: scale4, /content/temp_outputs/speaker_outputs/subsegments_scale4.json
[NeMo I 2023-11-24 10:01:56 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2023-11-24 10:01:56 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2023-11-24 10:01:56 collections:302] Dataset loaded with 1148 items, total duration of  0.32 hours.
[NeMo I 2023-11-24 10:01:56 collections:304] # 1148 files loaded accounting to # 1 labels

[5/6] extract embeddings: 100%|██████████| 18/18 [00:03<00:00,  4.69it/s][NeMo I 2023-11-24 10:02:00 clustering_diarizer:389] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
[NeMo I 2023-11-24 10:02:00 clustering_diarizer:287] Subsegmentation for embedding extraction: scale5, /content/temp_outputs/speaker_outputs/subsegments_scale5.json
[NeMo I 2023-11-24 10:02:00 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2023-11-24 10:02:00 collections:301] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2023-11-24 10:02:00 collections:302] Dataset loaded with 2296 items, total duration of  0.32 hours.
[NeMo I 2023-11-24 10:02:00 collections:304] # 2296 files loaded accounting to # 1 labels

[6/6] extract embeddings: 100%|██████████| 36/36 [00:05<00:00,  6.98it/s]
[NeMo I 2023-11-24 10:02:06 clustering_diarizer:389] Saved embedding files to /content/temp_outputs/speaker_outputs/embeddings
clustering: 100%|██████████| 1/1 [00:00<00:00,  1.37it/s][NeMo I 2023-11-24 10:02:06 clustering_diarizer:464] Outputs are saved in /content/temp_outputs directory

[NeMo W 2023-11-24 10:02:06 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate
[NeMo I 2023-11-24 10:02:07 msdd_models:960] Loading embedding pickle file of scale:0 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale0_embeddings.pkl
[NeMo I 2023-11-24 10:02:07 msdd_models:960] Loading embedding pickle file of scale:1 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale1_embeddings.pkl
[NeMo I 2023-11-24 10:02:07 msdd_models:960] Loading embedding pickle file of scale:2 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale2_embeddings.pkl
[NeMo I 2023-11-24 10:02:07 msdd_models:960] Loading embedding pickle file of scale:3 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale3_embeddings.pkl
[NeMo I 2023-11-24 10:02:07 msdd_models:960] Loading embedding pickle file of scale:4 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale4_embeddings.pkl
[NeMo I 2023-11-24 10:02:07 msdd_models:960] Loading embedding pickle file of scale:5 at /content/temp_outputs/speaker_outputs/embeddings/subsegments_scale5_embeddings.pkl
[NeMo I 2023-11-24 10:02:07 msdd_models:938] Loading cluster label file from /content/temp_outputs/speaker_outputs/subsegments_scale5_cluster.label
[NeMo I 2023-11-24 10:02:07 collections:617] Filtered duration for loading collection is 0.000000.
[NeMo I 2023-11-24 10:02:07 collections:620] Total 3 session files loaded accounting to # 3 audio clips
  0%|          | 0/1 [00:00<?, ?it/s]
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
[<ipython-input-13-8cafa8c83657>](https://localhost:8080/#) in <cell line: 3>()
      1 # Initialize NeMo MSDD diarization model
      2 msdd_model = NeuralDiarizer(cfg=create_config(temp_path)).to("cuda")
----> 3 msdd_model.diarize()
      4 
      5 del msdd_model

12 frames
[/usr/local/lib/python3.10/dist-packages/nemo/collections/asr/modules/msdd_diarizer.py](https://localhost:8080/#) in conv_forward(self, conv_input, conv_module, bn_module, first_layer)
    417         conv_out = conv_module(conv_input)
    418         conv_out = conv_out.permute(0, 2, 1, 3) if not first_layer else conv_out
--> 419         conv_out = conv_out.reshape(self.batch_size, self.length, self.cnn_output_ch, self.emb_dim)
    420         conv_out = conv_out.unsqueeze(2).flatten(0, 1)
    421         conv_out = bn_module(conv_out.permute(0, 3, 2, 1)).permute(0, 3, 2, 1)

RuntimeError: shape '[138, 50, 16, 192]' is invalid for input of size 84787200

Do you have any hint on how to solve this issue ?

MahmoudAshraf97 commented 11 months ago

Can you upload the audio you are using so I can reproduce it?

Ko8rah commented 11 months ago

Yes of course. Thank you for your reactivity. The audio is a french podcast and I'm running the notebook on colab with free T4 GPU.

podcast.mp3.zip

mjsteele12 commented 5 months ago

@Ko8rah did you ever find a solution? I am having the same problem.

MahmoudAshraf97 commented 5 months ago

The problem is caused by using meeting or general config, both are non-supported as for now, stick to telephonic