MahmoudAshraf97 / whisper-diarization

Automatic Speech Recognition with Speaker Diarization based on OpenAI Whisper
BSD 2-Clause "Simplified" License
3.28k stars 272 forks source link

Colab error: RuntimeError: Calculated padded input size per channel: (1). Kernel size: (2). Kernel size can't be greater than actual input size #79

Closed TheGermanEngie closed 4 months ago

TheGermanEngie commented 1 year ago

I wanted to make sure I got this error on another account before I reported this.

In the wav2vec section, sometimes it will fail and throw this error - ---------------------------------------------------------------------------

RuntimeError                              Traceback (most recent call last)

[<ipython-input-10-dde0cc33e474>](https://localhost:8080/#) in <cell line: 1>()
      4         language_code=info.language, device=device
      5     )
----> 6     result_aligned = whisperx.align(
      7         whisper_results, alignment_model, metadata, vocal_target, device
      8     )

9 frames

[/usr/local/lib/python3.10/dist-packages/torch/nn/modules/conv.py](https://localhost:8080/#) in _conv_forward(self, input, weight, bias)
    307                             weight, bias, self.stride,
    308                             _single(0), self.dilation, self.groups)
--> 309         return F.conv1d(input, weight, bias, self.stride,
    310                         self.padding, self.dilation, self.groups)
    311 

RuntimeError: Calculated padded input size per channel: (1). Kernel size: (2). Kernel size can't be greater than actual input size

I thought it was some sort of memory issue or free colab usage issue initially and so switched accounts to find the same file loaded and the notebook worked as it should, this error is from my second account using free colab.

MahmoudAshraf97 commented 1 year ago

So just to understand, you couldn't reproduce the error using another colab account?

TheGermanEngie commented 1 year ago

No, I was able to. Both my accounts use the free version of colab.

TheGermanEngie commented 1 year ago

I would like to update that this is no longer a colab issue - my third computer (budget 1660ti laptop) running CUDA 11.8 successfully completes a 1 hour clip, one 27:33 clip fails with the above RuntimeError, and one 26:05 clip successfully runs.

1 hr clip is .mp3, failed clip is .mp3, and other successful shorter clip is .m4a

I have no idea what the problem could be, at a loss.

(diarize) omen@omen-PC:~/AI/whisper-diarization$ python diarize.py -a tucker1hr.mp3 --no-stem --whisper-model medium.en --device cuda
[NeMo W 2023-08-21 19:53:47 experimental:27] Module <class 'nemo.collections.asr.modules.audio_modules.SpectrogramToMultichannelFeatures'> is experimental, not ready for production and is not fully supported. Use at your own risk.
Estimating duration from bitrate, this may be inaccurate
[NeMo I 2023-08-21 20:05:41 msdd_models:1092] Loading pretrained diar_msdd_telephonic model from NGC
[NeMo I 2023-08-21 20:05:41 cloud:58] Found existing object /home/omen/.cache/torch/NeMo/NeMo_1.19.1/diar_msdd_telephonic/3c3697a0a46f945574fa407149975a13/diar_msdd_telephonic.nemo.
[NeMo I 2023-08-21 20:05:41 cloud:64] Re-using file from: /home/omen/.cache/torch/NeMo/NeMo_1.19.1/diar_msdd_telephonic/3c3697a0a46f945574fa407149975a13/diar_msdd_telephonic.nemo
[NeMo I 2023-08-21 20:05:41 common:913] Instantiating model from pre-trained checkpoint
[NeMo W 2023-08-21 20:05:42 modelPT:161] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader.
    Train config : 
    manifest_filepath: null
    emb_dir: null
    sample_rate: 16000
    num_spks: 2
    soft_label_thres: 0.5
    labels: null
    batch_size: 15
    emb_batch_size: 0
    shuffle: true

[NeMo W 2023-08-21 20:05:42 modelPT:168] If you intend to do validation, please call the ModelPT.setup_validation_data() or ModelPT.setup_multiple_validation_data() method and provide a valid configuration file to setup the validation data loader(s). 
    Validation config : 
    manifest_filepath: null
    emb_dir: null
    sample_rate: 16000
    num_spks: 2
    soft_label_thres: 0.5
    labels: null
    batch_size: 15
    emb_batch_size: 0
    shuffle: false

[NeMo W 2023-08-21 20:05:42 modelPT:174] Please call the ModelPT.setup_test_data() or ModelPT.setup_multiple_test_data() method and provide a valid configuration file to setup the test data loader(s).
    Test config : 
    manifest_filepath: null
    emb_dir: null
    sample_rate: 16000
    num_spks: 2
    soft_label_thres: 0.5
    labels: null
    batch_size: 15
    emb_batch_size: 0
    shuffle: false
    seq_eval_mode: false

[NeMo I 2023-08-21 20:05:42 features:291] PADDING: 16
[NeMo I 2023-08-21 20:05:42 features:291] PADDING: 16
[NeMo I 2023-08-21 20:05:43 save_restore_connector:249] Model EncDecDiarLabelModel was successfully restored from /home/omen/.cache/torch/NeMo/NeMo_1.19.1/diar_msdd_telephonic/3c3697a0a46f945574fa407149975a13/diar_msdd_telephonic.nemo.
[NeMo I 2023-08-21 20:05:43 features:291] PADDING: 16
[NeMo I 2023-08-21 20:05:43 clustering_diarizer:127] Loading pretrained vad_multilingual_marblenet model from NGC
[NeMo I 2023-08-21 20:05:43 cloud:58] Found existing object /home/omen/.cache/torch/NeMo/NeMo_1.19.1/vad_multilingual_marblenet/670f425c7f186060b7a7268ba6dfacb2/vad_multilingual_marblenet.nemo.
[NeMo I 2023-08-21 20:05:43 cloud:64] Re-using file from: /home/omen/.cache/torch/NeMo/NeMo_1.19.1/vad_multilingual_marblenet/670f425c7f186060b7a7268ba6dfacb2/vad_multilingual_marblenet.nemo
[NeMo I 2023-08-21 20:05:43 common:913] Instantiating model from pre-trained checkpoint
[NeMo W 2023-08-21 20:05:43 modelPT:161] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader.
    Train config : 
    manifest_filepath: /manifests/ami_train_0.63.json,/manifests/freesound_background_train.json,/manifests/freesound_laughter_train.json,/manifests/fisher_2004_background.json,/manifests/fisher_2004_speech_sampled.json,/manifests/google_train_manifest.json,/manifests/icsi_all_0.63.json,/manifests/musan_freesound_train.json,/manifests/musan_music_train.json,/manifests/musan_soundbible_train.json,/manifests/mandarin_train_sample.json,/manifests/german_train_sample.json,/manifests/spanish_train_sample.json,/manifests/french_train_sample.json,/manifests/russian_train_sample.json
    sample_rate: 16000
    labels:
    - background
    - speech
    batch_size: 256
    shuffle: true
    is_tarred: false
    tarred_audio_filepaths: null
    tarred_shard_strategy: scatter
    augmentor:
      shift:
        prob: 0.5
        min_shift_ms: -10.0
        max_shift_ms: 10.0
      white_noise:
        prob: 0.5
        min_level: -90
        max_level: -46
        norm: true
      noise:
        prob: 0.5
        manifest_path: /manifests/noise_0_1_musan_fs.json
        min_snr_db: 0
        max_snr_db: 30
        max_gain_db: 300.0
        norm: true
      gain:
        prob: 0.5
        min_gain_dbfs: -10.0
        max_gain_dbfs: 10.0
        norm: true
    num_workers: 16
    pin_memory: true

[NeMo W 2023-08-21 20:05:43 modelPT:168] If you intend to do validation, please call the ModelPT.setup_validation_data() or ModelPT.setup_multiple_validation_data() method and provide a valid configuration file to setup the validation data loader(s). 
    Validation config : 
    manifest_filepath: /manifests/ami_dev_0.63.json,/manifests/freesound_background_dev.json,/manifests/freesound_laughter_dev.json,/manifests/ch120_moved_0.63.json,/manifests/fisher_2005_500_speech_sampled.json,/manifests/google_dev_manifest.json,/manifests/musan_music_dev.json,/manifests/mandarin_dev.json,/manifests/german_dev.json,/manifests/spanish_dev.json,/manifests/french_dev.json,/manifests/russian_dev.json
    sample_rate: 16000
    labels:
    - background
    - speech
    batch_size: 256
    shuffle: false
    val_loss_idx: 0
    num_workers: 16
    pin_memory: true

[NeMo W 2023-08-21 20:05:43 modelPT:174] Please call the ModelPT.setup_test_data() or ModelPT.setup_multiple_test_data() method and provide a valid configuration file to setup the test data loader(s).
    Test config : 
    manifest_filepath: null
    sample_rate: 16000
    labels:
    - background
    - speech
    batch_size: 128
    shuffle: false
    test_loss_idx: 0

[NeMo I 2023-08-21 20:05:43 features:291] PADDING: 16
[NeMo I 2023-08-21 20:05:43 save_restore_connector:249] Model EncDecClassificationModel was successfully restored from /home/omen/.cache/torch/NeMo/NeMo_1.19.1/vad_multilingual_marblenet/670f425c7f186060b7a7268ba6dfacb2/vad_multilingual_marblenet.nemo.
[NeMo I 2023-08-21 20:05:43 msdd_models:864] Multiscale Weights: [1, 1, 1, 1, 1]
[NeMo I 2023-08-21 20:05:43 msdd_models:865] Clustering Parameters: {
        "oracle_num_speakers": false,
        "max_num_speakers": 8,
        "enhanced_count_thres": 80,
        "max_rp_threshold": 0.25,
        "sparse_search_volume": 30,
        "maj_vote_spk_count": false
    }
[NeMo W 2023-08-21 20:05:43 clustering_diarizer:411] Deleting previous clustering diarizer outputs.
[NeMo I 2023-08-21 20:05:43 speaker_utils:93] Number of files to diarize: 1
[NeMo I 2023-08-21 20:05:43 clustering_diarizer:309] Split long audio file to avoid CUDA memory issue
splitting manifest: 100%|██████████████████████████████████████| 1/1 [00:01<00:00,  1.77s/it]
[NeMo I 2023-08-21 20:05:45 vad_utils:101] The prepared manifest file exists. Overwriting!
[NeMo I 2023-08-21 20:05:45 classification_models:268] Perform streaming frame-level VAD
[NeMo I 2023-08-21 20:05:45 collections:298] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2023-08-21 20:05:45 collections:299] Dataset loaded with 73 items, total duration of  1.01 hours.
[NeMo I 2023-08-21 20:05:45 collections:301] # 73 files loaded accounting to # 1 labels
vad: 100%|███████████████████████████████████████████████████| 73/73 [00:15<00:00,  4.77it/s]
[NeMo I 2023-08-21 20:06:00 clustering_diarizer:250] Generating predictions with overlapping input segments
[NeMo I 2023-08-21 20:06:28 clustering_diarizer:262] Converting frame level prediction to speech/no-speech segment in start and end times format.
creating speech segments: 100%|████████████████████████████████| 1/1 [00:04<00:00,  4.34s/it]
[NeMo I 2023-08-21 20:06:32 clustering_diarizer:287] Subsegmentation for embedding extraction: scale0, /home/omen/AI/whisper-diarization/temp_outputs/speaker_outputs/subsegments_scale0.json
[NeMo I 2023-08-21 20:06:33 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2023-08-21 20:06:33 collections:298] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2023-08-21 20:06:33 collections:299] Dataset loaded with 3103 items, total duration of  1.05 hours.
[NeMo I 2023-08-21 20:06:33 collections:301] # 3103 files loaded accounting to # 1 labels
[1/5] extract embeddings: 100%|██████████████████████████████| 49/49 [00:36<00:00,  1.36it/s]
[NeMo I 2023-08-21 20:07:09 clustering_diarizer:389] Saved embedding files to /home/omen/AI/whisper-diarization/temp_outputs/speaker_outputs/embeddings
[NeMo I 2023-08-21 20:07:09 clustering_diarizer:287] Subsegmentation for embedding extraction: scale1, /home/omen/AI/whisper-diarization/temp_outputs/speaker_outputs/subsegments_scale1.json
[NeMo I 2023-08-21 20:07:09 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2023-08-21 20:07:09 collections:298] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2023-08-21 20:07:09 collections:299] Dataset loaded with 3714 items, total duration of  1.10 hours.
[NeMo I 2023-08-21 20:07:09 collections:301] # 3714 files loaded accounting to # 1 labels
[2/5] extract embeddings: 100%|██████████████████████████████| 59/59 [00:30<00:00,  1.93it/s]
[NeMo I 2023-08-21 20:07:39 clustering_diarizer:389] Saved embedding files to /home/omen/AI/whisper-diarization/temp_outputs/speaker_outputs/embeddings
[NeMo I 2023-08-21 20:07:39 clustering_diarizer:287] Subsegmentation for embedding extraction: scale2, /home/omen/AI/whisper-diarization/temp_outputs/speaker_outputs/subsegments_scale2.json
[NeMo I 2023-08-21 20:07:39 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2023-08-21 20:07:40 collections:298] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2023-08-21 20:07:40 collections:299] Dataset loaded with 4605 items, total duration of  1.14 hours.
[NeMo I 2023-08-21 20:07:40 collections:301] # 4605 files loaded accounting to # 1 labels
[3/5] extract embeddings: 100%|██████████████████████████████| 72/72 [00:37<00:00,  1.92it/s]
[NeMo I 2023-08-21 20:08:17 clustering_diarizer:389] Saved embedding files to /home/omen/AI/whisper-diarization/temp_outputs/speaker_outputs/embeddings
[NeMo I 2023-08-21 20:08:17 clustering_diarizer:287] Subsegmentation for embedding extraction: scale3, /home/omen/AI/whisper-diarization/temp_outputs/speaker_outputs/subsegments_scale3.json
[NeMo I 2023-08-21 20:08:17 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2023-08-21 20:08:17 collections:298] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2023-08-21 20:08:17 collections:299] Dataset loaded with 6180 items, total duration of  1.19 hours.
[NeMo I 2023-08-21 20:08:17 collections:301] # 6180 files loaded accounting to # 1 labels
[4/5] extract embeddings: 100%|██████████████████████████████| 97/97 [00:22<00:00,  4.33it/s]
[NeMo I 2023-08-21 20:08:40 clustering_diarizer:389] Saved embedding files to /home/omen/AI/whisper-diarization/temp_outputs/speaker_outputs/embeddings
[NeMo I 2023-08-21 20:08:40 clustering_diarizer:287] Subsegmentation for embedding extraction: scale4, /home/omen/AI/whisper-diarization/temp_outputs/speaker_outputs/subsegments_scale4.json
[NeMo I 2023-08-21 20:08:40 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2023-08-21 20:08:40 collections:298] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2023-08-21 20:08:40 collections:299] Dataset loaded with 9403 items, total duration of  1.25 hours.
[NeMo I 2023-08-21 20:08:40 collections:301] # 9403 files loaded accounting to # 1 labels
[5/5] extract embeddings: 100%|████████████████████████████| 147/147 [00:27<00:00,  5.39it/s]
[NeMo I 2023-08-21 20:09:08 clustering_diarizer:389] Saved embedding files to /home/omen/AI/whisper-diarization/temp_outputs/speaker_outputs/embeddings
clustering: 100%|██████████████████████████████████████████████| 1/1 [00:06<00:00,  6.21s/it]
[NeMo I 2023-08-21 20:09:15 clustering_diarizer:464] Outputs are saved in /home/omen/AI/whisper-diarization/temp_outputs directory
[NeMo W 2023-08-21 20:09:15 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate
[NeMo I 2023-08-21 20:09:15 msdd_models:960] Loading embedding pickle file of scale:0 at /home/omen/AI/whisper-diarization/temp_outputs/speaker_outputs/embeddings/subsegments_scale0_embeddings.pkl
[NeMo I 2023-08-21 20:09:15 msdd_models:960] Loading embedding pickle file of scale:1 at /home/omen/AI/whisper-diarization/temp_outputs/speaker_outputs/embeddings/subsegments_scale1_embeddings.pkl
[NeMo I 2023-08-21 20:09:15 msdd_models:960] Loading embedding pickle file of scale:2 at /home/omen/AI/whisper-diarization/temp_outputs/speaker_outputs/embeddings/subsegments_scale2_embeddings.pkl
[NeMo I 2023-08-21 20:09:15 msdd_models:960] Loading embedding pickle file of scale:3 at /home/omen/AI/whisper-diarization/temp_outputs/speaker_outputs/embeddings/subsegments_scale3_embeddings.pkl
[NeMo I 2023-08-21 20:09:15 msdd_models:960] Loading embedding pickle file of scale:4 at /home/omen/AI/whisper-diarization/temp_outputs/speaker_outputs/embeddings/subsegments_scale4_embeddings.pkl
[NeMo I 2023-08-21 20:09:15 msdd_models:938] Loading cluster label file from /home/omen/AI/whisper-diarization/temp_outputs/speaker_outputs/subsegments_scale4_cluster.label
[NeMo I 2023-08-21 20:09:16 collections:614] Filtered duration for loading collection is 0.000000.
[NeMo I 2023-08-21 20:09:16 collections:617] Total 1 session files loaded accounting to # 1 audio clips
100%|██████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  1.80it/s]
[NeMo I 2023-08-21 20:09:17 msdd_models:1403]      [Threshold: 0.7000] [use_clus_as_main=False] [diar_window=50]
[NeMo I 2023-08-21 20:09:17 speaker_utils:93] Number of files to diarize: 1
[NeMo I 2023-08-21 20:09:17 speaker_utils:93] Number of files to diarize: 1
[NeMo W 2023-08-21 20:09:17 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate
[NeMo I 2023-08-21 20:09:17 speaker_utils:93] Number of files to diarize: 1
[NeMo W 2023-08-21 20:09:17 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate
[NeMo I 2023-08-21 20:09:17 speaker_utils:93] Number of files to diarize: 1
[NeMo W 2023-08-21 20:09:18 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate
[NeMo I 2023-08-21 20:09:18 msdd_models:1431]   

[NeMo W 2023-08-21 20:09:20 nemo_logging:349] /home/omen/miniconda3/envs/diarize/lib/python3.9/site-packages/transformers/pipelines/token_classification.py:169: UserWarning: `grouped_entities` is deprecated and will be removed in version v5.0.0, defaulted to `aggregation_strategy="none"` instead.
      warnings.warn(

[NeMo W 2023-08-21 20:09:21 nemo_logging:349] /home/omen/miniconda3/envs/diarize/lib/python3.9/site-packages/transformers/pipelines/base.py:1083: UserWarning: You seem to be using the pipelines sequentially on GPU. In order to maximize efficiency please use a dataset
      warnings.warn(
(diarize) omen@omen-PC:~/AI/whisper-diarization$ python diarize.py -a tucker_two_small.mp3 --no-stem --whisper-model medium.en --device cuda
[NeMo W 2023-08-21 20:09:47 experimental:27] Module <class 'nemo.collections.asr.modules.audio_modules.SpectrogramToMultichannelFeatures'> is experimental, not ready for production and is not fully supported. Use at your own risk.
Estimating duration from bitrate, this may be inaccurate
Traceback (most recent call last):
  File "/home/omen/AI/whisper-diarization/diarize.py", line 89, in <module>
    result_aligned = whisperx.align(
  File "/home/omen/miniconda3/envs/diarize/lib/python3.9/site-packages/whisperx/alignment.py", line 224, in align
    emissions, _ = model(waveform_segment.to(device))
  File "/home/omen/miniconda3/envs/diarize/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/omen/miniconda3/envs/diarize/lib/python3.9/site-packages/torchaudio/models/wav2vec2/model.py", line 116, in forward
    x, lengths = self.feature_extractor(waveforms, lengths)
  File "/home/omen/miniconda3/envs/diarize/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/omen/miniconda3/envs/diarize/lib/python3.9/site-packages/torchaudio/models/wav2vec2/components.py", line 141, in forward
    x, length = layer(x, length)  # (batch, feature, frame)
  File "/home/omen/miniconda3/envs/diarize/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/omen/miniconda3/envs/diarize/lib/python3.9/site-packages/torchaudio/models/wav2vec2/components.py", line 90, in forward
    x = self.conv(x)
  File "/home/omen/miniconda3/envs/diarize/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/omen/miniconda3/envs/diarize/lib/python3.9/site-packages/torch/nn/modules/conv.py", line 313, in forward
    return self._conv_forward(input, self.weight, self.bias)
  File "/home/omen/miniconda3/envs/diarize/lib/python3.9/site-packages/torch/nn/modules/conv.py", line 309, in _conv_forward
    return F.conv1d(input, weight, bias, self.stride,
RuntimeError: Calculated padded input size per channel: (1). Kernel size: (2). Kernel size can't be greater than actual input size
(diarize) omen@omen-PC:~/AI/whisper-diarization$ python diarize.py -a tucker_largest_solo.m4a --no-stem --whisper-model medium.en --device cuda
[NeMo W 2023-08-21 20:15:53 experimental:27] Module <class 'nemo.collections.asr.modules.audio_modules.SpectrogramToMultichannelFeatures'> is experimental, not ready for production and is not fully supported. Use at your own risk.
[NeMo W 2023-08-21 20:20:56 nemo_logging:349] /home/omen/AI/whisper-diarization/diarize.py:104: UserWarning: PySoundFile failed. Trying audioread instead.
      signal, sample_rate = librosa.load(vocal_target, sr=None)

[NeMo W 2023-08-21 20:20:56 nemo_logging:349] /home/omen/miniconda3/envs/diarize/lib/python3.9/site-packages/librosa/core/audio.py:183: FutureWarning: librosa.core.audio.__audioread_load
        Deprecated as of librosa version 0.10.0.
        It will be removed in librosa version 1.0.
      y, sr_native = __audioread_load(path, offset, duration, dtype)

100% [.........................................................................] 7336 / 7336[NeMo I 2023-08-21 20:21:00 msdd_models:1092] Loading pretrained diar_msdd_telephonic model from NGC
[NeMo I 2023-08-21 20:21:00 cloud:58] Found existing object /home/omen/.cache/torch/NeMo/NeMo_1.19.1/diar_msdd_telephonic/3c3697a0a46f945574fa407149975a13/diar_msdd_telephonic.nemo.
[NeMo I 2023-08-21 20:21:00 cloud:64] Re-using file from: /home/omen/.cache/torch/NeMo/NeMo_1.19.1/diar_msdd_telephonic/3c3697a0a46f945574fa407149975a13/diar_msdd_telephonic.nemo
[NeMo I 2023-08-21 20:21:00 common:913] Instantiating model from pre-trained checkpoint
[NeMo W 2023-08-21 20:21:01 modelPT:161] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader.
    Train config : 
    manifest_filepath: null
    emb_dir: null
    sample_rate: 16000
    num_spks: 2
    soft_label_thres: 0.5
    labels: null
    batch_size: 15
    emb_batch_size: 0
    shuffle: true

[NeMo W 2023-08-21 20:21:01 modelPT:168] If you intend to do validation, please call the ModelPT.setup_validation_data() or ModelPT.setup_multiple_validation_data() method and provide a valid configuration file to setup the validation data loader(s). 
    Validation config : 
    manifest_filepath: null
    emb_dir: null
    sample_rate: 16000
    num_spks: 2
    soft_label_thres: 0.5
    labels: null
    batch_size: 15
    emb_batch_size: 0
    shuffle: false

[NeMo W 2023-08-21 20:21:01 modelPT:174] Please call the ModelPT.setup_test_data() or ModelPT.setup_multiple_test_data() method and provide a valid configuration file to setup the test data loader(s).
    Test config : 
    manifest_filepath: null
    emb_dir: null
    sample_rate: 16000
    num_spks: 2
    soft_label_thres: 0.5
    labels: null
    batch_size: 15
    emb_batch_size: 0
    shuffle: false
    seq_eval_mode: false

[NeMo I 2023-08-21 20:21:01 features:291] PADDING: 16
[NeMo I 2023-08-21 20:21:01 features:291] PADDING: 16
[NeMo I 2023-08-21 20:21:01 save_restore_connector:249] Model EncDecDiarLabelModel was successfully restored from /home/omen/.cache/torch/NeMo/NeMo_1.19.1/diar_msdd_telephonic/3c3697a0a46f945574fa407149975a13/diar_msdd_telephonic.nemo.
[NeMo I 2023-08-21 20:21:01 features:291] PADDING: 16
[NeMo I 2023-08-21 20:21:01 clustering_diarizer:127] Loading pretrained vad_multilingual_marblenet model from NGC
[NeMo I 2023-08-21 20:21:01 cloud:58] Found existing object /home/omen/.cache/torch/NeMo/NeMo_1.19.1/vad_multilingual_marblenet/670f425c7f186060b7a7268ba6dfacb2/vad_multilingual_marblenet.nemo.
[NeMo I 2023-08-21 20:21:01 cloud:64] Re-using file from: /home/omen/.cache/torch/NeMo/NeMo_1.19.1/vad_multilingual_marblenet/670f425c7f186060b7a7268ba6dfacb2/vad_multilingual_marblenet.nemo
[NeMo I 2023-08-21 20:21:01 common:913] Instantiating model from pre-trained checkpoint
[NeMo W 2023-08-21 20:21:02 modelPT:161] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader.
    Train config : 
    manifest_filepath: /manifests/ami_train_0.63.json,/manifests/freesound_background_train.json,/manifests/freesound_laughter_train.json,/manifests/fisher_2004_background.json,/manifests/fisher_2004_speech_sampled.json,/manifests/google_train_manifest.json,/manifests/icsi_all_0.63.json,/manifests/musan_freesound_train.json,/manifests/musan_music_train.json,/manifests/musan_soundbible_train.json,/manifests/mandarin_train_sample.json,/manifests/german_train_sample.json,/manifests/spanish_train_sample.json,/manifests/french_train_sample.json,/manifests/russian_train_sample.json
    sample_rate: 16000
    labels:
    - background
    - speech
    batch_size: 256
    shuffle: true
    is_tarred: false
    tarred_audio_filepaths: null
    tarred_shard_strategy: scatter
    augmentor:
      shift:
        prob: 0.5
        min_shift_ms: -10.0
        max_shift_ms: 10.0
      white_noise:
        prob: 0.5
        min_level: -90
        max_level: -46
        norm: true
      noise:
        prob: 0.5
        manifest_path: /manifests/noise_0_1_musan_fs.json
        min_snr_db: 0
        max_snr_db: 30
        max_gain_db: 300.0
        norm: true
      gain:
        prob: 0.5
        min_gain_dbfs: -10.0
        max_gain_dbfs: 10.0
        norm: true
    num_workers: 16
    pin_memory: true

[NeMo W 2023-08-21 20:21:02 modelPT:168] If you intend to do validation, please call the ModelPT.setup_validation_data() or ModelPT.setup_multiple_validation_data() method and provide a valid configuration file to setup the validation data loader(s). 
    Validation config : 
    manifest_filepath: /manifests/ami_dev_0.63.json,/manifests/freesound_background_dev.json,/manifests/freesound_laughter_dev.json,/manifests/ch120_moved_0.63.json,/manifests/fisher_2005_500_speech_sampled.json,/manifests/google_dev_manifest.json,/manifests/musan_music_dev.json,/manifests/mandarin_dev.json,/manifests/german_dev.json,/manifests/spanish_dev.json,/manifests/french_dev.json,/manifests/russian_dev.json
    sample_rate: 16000
    labels:
    - background
    - speech
    batch_size: 256
    shuffle: false
    val_loss_idx: 0
    num_workers: 16
    pin_memory: true

[NeMo W 2023-08-21 20:21:02 modelPT:174] Please call the ModelPT.setup_test_data() or ModelPT.setup_multiple_test_data() method and provide a valid configuration file to setup the test data loader(s).
    Test config : 
    manifest_filepath: null
    sample_rate: 16000
    labels:
    - background
    - speech
    batch_size: 128
    shuffle: false
    test_loss_idx: 0

[NeMo I 2023-08-21 20:21:02 features:291] PADDING: 16
[NeMo I 2023-08-21 20:21:02 save_restore_connector:249] Model EncDecClassificationModel was successfully restored from /home/omen/.cache/torch/NeMo/NeMo_1.19.1/vad_multilingual_marblenet/670f425c7f186060b7a7268ba6dfacb2/vad_multilingual_marblenet.nemo.
[NeMo I 2023-08-21 20:21:02 msdd_models:864] Multiscale Weights: [1, 1, 1, 1, 1]
[NeMo I 2023-08-21 20:21:02 msdd_models:865] Clustering Parameters: {
        "oracle_num_speakers": false,
        "max_num_speakers": 8,
        "enhanced_count_thres": 80,
        "max_rp_threshold": 0.25,
        "sparse_search_volume": 30,
        "maj_vote_spk_count": false
    }
[NeMo I 2023-08-21 20:21:02 speaker_utils:93] Number of files to diarize: 1
[NeMo I 2023-08-21 20:21:02 clustering_diarizer:309] Split long audio file to avoid CUDA memory issue
splitting manifest: 100%|██████████████████████████████████████| 1/1 [00:00<00:00,  1.20it/s]
[NeMo I 2023-08-21 20:21:02 classification_models:268] Perform streaming frame-level VAD
[NeMo I 2023-08-21 20:21:02 collections:298] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2023-08-21 20:21:02 collections:299] Dataset loaded with 32 items, total duration of  0.44 hours.
[NeMo I 2023-08-21 20:21:02 collections:301] # 32 files loaded accounting to # 1 labels
vad: 100%|███████████████████████████████████████████████████| 32/32 [00:06<00:00,  4.71it/s]
[NeMo I 2023-08-21 20:21:09 clustering_diarizer:250] Generating predictions with overlapping input segments
[NeMo I 2023-08-21 20:21:22 clustering_diarizer:262] Converting frame level prediction to speech/no-speech segment in start and end times format.
creating speech segments: 100%|████████████████████████████████| 1/1 [00:01<00:00,  1.87s/it]
[NeMo I 2023-08-21 20:21:24 clustering_diarizer:287] Subsegmentation for embedding extraction: scale0, /home/omen/AI/whisper-diarization/temp_outputs/speaker_outputs/subsegments_scale0.json
[NeMo I 2023-08-21 20:21:24 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2023-08-21 20:21:24 collections:298] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2023-08-21 20:21:24 collections:299] Dataset loaded with 1299 items, total duration of  0.46 hours.
[NeMo I 2023-08-21 20:21:24 collections:301] # 1299 files loaded accounting to # 1 labels
[1/5] extract embeddings: 100%|██████████████████████████████| 21/21 [00:15<00:00,  1.37it/s]
[NeMo I 2023-08-21 20:21:39 clustering_diarizer:389] Saved embedding files to /home/omen/AI/whisper-diarization/temp_outputs/speaker_outputs/embeddings
[NeMo I 2023-08-21 20:21:39 clustering_diarizer:287] Subsegmentation for embedding extraction: scale1, /home/omen/AI/whisper-diarization/temp_outputs/speaker_outputs/subsegments_scale1.json
[NeMo I 2023-08-21 20:21:39 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2023-08-21 20:21:39 collections:298] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2023-08-21 20:21:39 collections:299] Dataset loaded with 1563 items, total duration of  0.48 hours.
[NeMo I 2023-08-21 20:21:39 collections:301] # 1563 files loaded accounting to # 1 labels
[2/5] extract embeddings: 100%|██████████████████████████████| 25/25 [00:12<00:00,  1.94it/s]
[NeMo I 2023-08-21 20:21:52 clustering_diarizer:389] Saved embedding files to /home/omen/AI/whisper-diarization/temp_outputs/speaker_outputs/embeddings
[NeMo I 2023-08-21 20:21:52 clustering_diarizer:287] Subsegmentation for embedding extraction: scale2, /home/omen/AI/whisper-diarization/temp_outputs/speaker_outputs/subsegments_scale2.json
[NeMo I 2023-08-21 20:21:52 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2023-08-21 20:21:52 collections:298] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2023-08-21 20:21:52 collections:299] Dataset loaded with 1959 items, total duration of  0.50 hours.
[NeMo I 2023-08-21 20:21:52 collections:301] # 1959 files loaded accounting to # 1 labels
[3/5] extract embeddings: 100%|██████████████████████████████| 31/31 [00:15<00:00,  1.94it/s]
[NeMo I 2023-08-21 20:22:08 clustering_diarizer:389] Saved embedding files to /home/omen/AI/whisper-diarization/temp_outputs/speaker_outputs/embeddings
[NeMo I 2023-08-21 20:22:08 clustering_diarizer:287] Subsegmentation for embedding extraction: scale3, /home/omen/AI/whisper-diarization/temp_outputs/speaker_outputs/subsegments_scale3.json
[NeMo I 2023-08-21 20:22:08 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2023-08-21 20:22:08 collections:298] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2023-08-21 20:22:08 collections:299] Dataset loaded with 2647 items, total duration of  0.52 hours.
[NeMo I 2023-08-21 20:22:08 collections:301] # 2647 files loaded accounting to # 1 labels
[4/5] extract embeddings: 100%|██████████████████████████████| 42/42 [00:09<00:00,  4.30it/s]
[NeMo I 2023-08-21 20:22:18 clustering_diarizer:389] Saved embedding files to /home/omen/AI/whisper-diarization/temp_outputs/speaker_outputs/embeddings
[NeMo I 2023-08-21 20:22:18 clustering_diarizer:287] Subsegmentation for embedding extraction: scale4, /home/omen/AI/whisper-diarization/temp_outputs/speaker_outputs/subsegments_scale4.json
[NeMo I 2023-08-21 20:22:18 clustering_diarizer:343] Extracting embeddings for Diarization
[NeMo I 2023-08-21 20:22:18 collections:298] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2023-08-21 20:22:18 collections:299] Dataset loaded with 4047 items, total duration of  0.54 hours.
[NeMo I 2023-08-21 20:22:18 collections:301] # 4047 files loaded accounting to # 1 labels
[5/5] extract embeddings: 100%|██████████████████████████████| 64/64 [00:11<00:00,  5.38it/s]
[NeMo I 2023-08-21 20:22:30 clustering_diarizer:389] Saved embedding files to /home/omen/AI/whisper-diarization/temp_outputs/speaker_outputs/embeddings
clustering: 100%|██████████████████████████████████████████████| 1/1 [00:01<00:00,  1.30s/it]
[NeMo I 2023-08-21 20:22:31 clustering_diarizer:464] Outputs are saved in /home/omen/AI/whisper-diarization/temp_outputs directory
[NeMo W 2023-08-21 20:22:31 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate
[NeMo I 2023-08-21 20:22:31 msdd_models:960] Loading embedding pickle file of scale:0 at /home/omen/AI/whisper-diarization/temp_outputs/speaker_outputs/embeddings/subsegments_scale0_embeddings.pkl
[NeMo I 2023-08-21 20:22:31 msdd_models:960] Loading embedding pickle file of scale:1 at /home/omen/AI/whisper-diarization/temp_outputs/speaker_outputs/embeddings/subsegments_scale1_embeddings.pkl
[NeMo I 2023-08-21 20:22:31 msdd_models:960] Loading embedding pickle file of scale:2 at /home/omen/AI/whisper-diarization/temp_outputs/speaker_outputs/embeddings/subsegments_scale2_embeddings.pkl
[NeMo I 2023-08-21 20:22:31 msdd_models:960] Loading embedding pickle file of scale:3 at /home/omen/AI/whisper-diarization/temp_outputs/speaker_outputs/embeddings/subsegments_scale3_embeddings.pkl
[NeMo I 2023-08-21 20:22:31 msdd_models:960] Loading embedding pickle file of scale:4 at /home/omen/AI/whisper-diarization/temp_outputs/speaker_outputs/embeddings/subsegments_scale4_embeddings.pkl
[NeMo I 2023-08-21 20:22:31 msdd_models:938] Loading cluster label file from /home/omen/AI/whisper-diarization/temp_outputs/speaker_outputs/subsegments_scale4_cluster.label
[NeMo I 2023-08-21 20:22:32 collections:614] Filtered duration for loading collection is 0.000000.
[NeMo I 2023-08-21 20:22:32 collections:617] Total 1 session files loaded accounting to # 1 audio clips
100%|██████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  4.67it/s]
[NeMo I 2023-08-21 20:22:32 msdd_models:1403]      [Threshold: 0.7000] [use_clus_as_main=False] [diar_window=50]
[NeMo I 2023-08-21 20:22:32 speaker_utils:93] Number of files to diarize: 1
[NeMo I 2023-08-21 20:22:32 speaker_utils:93] Number of files to diarize: 1
[NeMo W 2023-08-21 20:22:32 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate
[NeMo I 2023-08-21 20:22:32 speaker_utils:93] Number of files to diarize: 1
[NeMo W 2023-08-21 20:22:32 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate
[NeMo I 2023-08-21 20:22:32 speaker_utils:93] Number of files to diarize: 1
[NeMo W 2023-08-21 20:22:32 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate
[NeMo I 2023-08-21 20:22:32 msdd_models:1431]   

[NeMo W 2023-08-21 20:22:35 nemo_logging:349] /home/omen/miniconda3/envs/diarize/lib/python3.9/site-packages/transformers/pipelines/token_classification.py:169: UserWarning: `grouped_entities` is deprecated and will be removed in version v5.0.0, defaulted to `aggregation_strategy="none"` instead.
      warnings.warn(

[NeMo W 2023-08-21 20:22:36 nemo_logging:349] /home/omen/miniconda3/envs/diarize/lib/python3.9/site-packages/transformers/pipelines/base.py:1083: UserWarning: You seem to be using the pipelines sequentially on GPU. In order to maximize efficiency please use a dataset
      warnings.warn(
TheGermanEngie commented 1 year ago

Don't know why there are strikethroughs in part of the comment.

cateyelow commented 1 year ago

let's upgrade whisperx

pip install git+https://github.com/m-bain/whisperx.git --upgrade

then change beam_size from 1 to 7 on diarize.py segments, info = whisper_model.transcribe( vocal_target, beam_size=7, word_timestamps=True, language=info.language, )

MahmoudAshraf97 commented 1 year ago

I guess this can be solved by upgrading whisperx as @cateyelow suggests, but the problem is that older versions of whisperx fallback to original timings when backtracking fails, but the newer versions omit the segment which causes errors further down the line

TheGermanEngie commented 1 year ago

I have been informed by GPT-4 code interpreter that, over a series of back and fourth questions, the shorter one failed and longer one worked possibly because of how the model batches the file.. if you want me to paste our conversation I can.

Using the hour long clip works consistently on your colab, helping my theory.

TheGermanEngie commented 1 year ago

More updates. So as part of a school project I'm using audio in ~2 hour chunks from a talk show host which has very clear audio and one can easily tell when the host or the guest is speaking. On my local machine I have two clips that are within one second of each other and the first one successfully completes while the second one throws the kernel error. They aren't exactly 2 hours each because some clips are maybe 2-6 minutes shorter or longer than others, but they average around 10 minutes each and anything more than 2 hours doesn't fit into memory. I also had one file throw a CUDA memory error but was the only file that did so. Any ideas on how it could be resolved, maybe a WhisperX update or something else?

TheGermanEngie commented 1 year ago

Also, on a some of the files that did successfully work, diarization fails completely and what should be Speaker 1's text is just a part of the same stream, like so:

Normal: Speaker 0: Good evening and welcome.

Speaker 1: Thanks for having me.

Speaker 0: Good evening and welcome. Thanks for having me.

I've also had some files where diarization works but the speaker after 0 is something big like 3 or 7.

All files are .mp3 at 192 bitrate.

TheGermanEngie commented 11 months ago

Hi - I noticed you updated the nemo toolkit. Is there a chance this could solve my error I've been having?

MahmoudAshraf97 commented 11 months ago

Hi All, this error will be fixed after this PR is merged https://github.com/m-bain/whisperX/pull/510