MahmoudAshraf97 / whisper-diarization

Automatic Speech Recognition with Speaker Diarization based on OpenAI Whisper
BSD 2-Clause "Simplified" License
3.28k stars 272 forks source link

"SoundFile" object has no attrobute "frames" && Source splitting failed #63

Closed labspicsprod closed 4 months ago

labspicsprod commented 1 year ago

Hello ! I have this error : How can I fix it ? Is it because I use .wav file ?


[NeMo W 2023-07-04 12:07:24 experimental:27] Module <class 'nemo.collections.asr.modules.audio_modules.SpectrogramToMultichannelFeatures'> is experimental, not ready for production and is not fully supported. Use at your own risk.
Python was not found; run without arguments to install from the Microsoft Store, or disable this shortcut from Settings > Manage App Execution Aliases.
WARNING:root:Source splitting failed, using original audio file. Use --no-stem argument to disable it.
[NeMo I 2023-07-04 12:09:09 msdd_models:1092] Loading pretrained diar_msdd_telephonic model from NGC
[NeMo I 2023-07-04 12:09:09 cloud:58] Found existing object C:\Users\STAGIAIRE\.cache\torch\NeMo\NeMo_1.17.0\diar_msdd_telephonic\3c3697a0a46f945574fa407149975a13\diar_msdd_telephonic.nemo.
[NeMo I 2023-07-04 12:09:09 cloud:64] Re-using file from: C:\Users\STAGIAIRE\.cache\torch\NeMo\NeMo_1.17.0\diar_msdd_telephonic\3c3697a0a46f945574fa407149975a13\diar_msdd_telephonic.nemo
[NeMo I 2023-07-04 12:09:09 common:913] Instantiating model from pre-trained checkpoint
[NeMo W 2023-07-04 12:09:10 modelPT:161] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader.
    Train config :
    manifest_filepath: null
    emb_dir: null
    sample_rate: 16000
    num_spks: 2
    soft_label_thres: 0.5
    labels: null
    batch_size: 15
    emb_batch_size: 0
    shuffle: true

[NeMo W 2023-07-04 12:09:10 modelPT:168] If you intend to do validation, please call the ModelPT.setup_validation_data() or ModelPT.setup_multiple_validation_data() method and provide a valid configuration file to setup the validation data loader(s).
    Validation config :
    manifest_filepath: null
    emb_dir: null
    sample_rate: 16000
    num_spks: 2
    soft_label_thres: 0.5
    labels: null
    batch_size: 15
    emb_batch_size: 0
    shuffle: false

[NeMo W 2023-07-04 12:09:10 modelPT:174] Please call the ModelPT.setup_test_data() or ModelPT.setup_multiple_test_data() method and provide a valid configuration file to setup the test data loader(s).        
    Test config :
    manifest_filepath: null
    emb_dir: null
    sample_rate: 16000
    num_spks: 2
    soft_label_thres: 0.5
    labels: null
    batch_size: 15
    emb_batch_size: 0
    shuffle: false
    seq_eval_mode: false

[NeMo I 2023-07-04 12:09:10 features:287] PADDING: 16
[NeMo I 2023-07-04 12:09:10 features:287] PADDING: 16
[NeMo I 2023-07-04 12:09:10 save_restore_connector:247] Model EncDecDiarLabelModel was successfully restored from C:\Users\STAGIAIRE\.cache\torch\NeMo\NeMo_1.17.0\diar_msdd_telephonic\3c3697a0a46f945574fa407149975a13\diar_msdd_telephonic.nemo.
[NeMo I 2023-07-04 12:09:10 features:287] PADDING: 16
[NeMo I 2023-07-04 12:09:10 clustering_diarizer:127] Loading pretrained vad_multilingual_marblenet model from NGC
[NeMo I 2023-07-04 12:09:10 cloud:58] Found existing object C:\Users\STAGIAIRE\.cache\torch\NeMo\NeMo_1.17.0\vad_multilingual_marblenet\670f425c7f186060b7a7268ba6dfacb2\vad_multilingual_marblenet.nemo.
[NeMo I 2023-07-04 12:09:10 cloud:64] Re-using file from: C:\Users\STAGIAIRE\.cache\torch\NeMo\NeMo_1.17.0\vad_multilingual_marblenet\670f425c7f186060b7a7268ba6dfacb2\vad_multilingual_marblenet.nemo
[NeMo I 2023-07-04 12:09:11 common:913] Instantiating model from pre-trained checkpoint
[NeMo W 2023-07-04 12:09:11 modelPT:161] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader.
    Train config :
    manifest_filepath: /manifests/ami_train_0.63.json,/manifests/freesound_background_train.json,/manifests/freesound_laughter_train.json,/manifests/fisher_2004_background.json,/manifests/fisher_2004_speech_sampled.json,/manifests/google_train_manifest.json,/manifests/icsi_all_0.63.json,/manifests/musan_freesound_train.json,/manifests/musan_music_train.json,/manifests/musan_soundbible_train.json,/manifests/mandarin_train_sample.json,/manifests/german_train_sample.json,/manifests/spanish_train_sample.json,/manifests/french_train_sample.json,/manifests/russian_train_sample.json
    sample_rate: 16000
    labels:
    - background
    - speech
    batch_size: 256
    shuffle: true
    is_tarred: false
    tarred_audio_filepaths: null
    tarred_shard_strategy: scatter
    augmentor:
      shift:
        prob: 0.5
        min_shift_ms: -10.0
        max_shift_ms: 10.0
      white_noise:
        prob: 0.5
        min_level: -90
        max_level: -46
        norm: true
      noise:
        prob: 0.5
        manifest_path: /manifests/noise_0_1_musan_fs.json
        min_snr_db: 0
        max_snr_db: 30
        max_gain_db: 300.0
        norm: true
      gain:
        prob: 0.5
        min_gain_dbfs: -10.0
        max_gain_dbfs: 10.0
        norm: true
    num_workers: 16
    pin_memory: true

[NeMo W 2023-07-04 12:09:11 modelPT:168] If you intend to do validation, please call the ModelPT.setup_validation_data() or ModelPT.setup_multiple_validation_data() method and provide a valid configuration file to setup the validation data loader(s).
    Validation config :
    manifest_filepath: /manifests/ami_dev_0.63.json,/manifests/freesound_background_dev.json,/manifests/freesound_laughter_dev.json,/manifests/ch120_moved_0.63.json,/manifests/fisher_2005_500_speech_sampled.json,/manifests/google_dev_manifest.json,/manifests/musan_music_dev.json,/manifests/mandarin_dev.json,/manifests/german_dev.json,/manifests/spanish_dev.json,/manifests/french_dev.json,/manifests/russian_dev.json
    sample_rate: 16000
    labels:
    - background
    - speech
    batch_size: 256
    shuffle: false
    val_loss_idx: 0
    num_workers: 16
    pin_memory: true

[NeMo W 2023-07-04 12:09:11 modelPT:174] Please call the ModelPT.setup_test_data() or ModelPT.setup_multiple_test_data() method and provide a valid configuration file to setup the test data loader(s).        
    Test config :
    manifest_filepath: null
    sample_rate: 16000
    labels:
    - background
    - speech
    batch_size: 128
    shuffle: false
    test_loss_idx: 0

[NeMo I 2023-07-04 12:09:11 features:287] PADDING: 16
[NeMo I 2023-07-04 12:09:11 save_restore_connector:247] Model EncDecClassificationModel was successfully restored from C:\Users\STAGIAIRE\.cache\torch\NeMo\NeMo_1.17.0\vad_multilingual_marblenet\670f425c7f186060b7a7268ba6dfacb2\vad_multilingual_marblenet.nemo.
[NeMo I 2023-07-04 12:09:11 msdd_models:864] Multiscale Weights: [1, 1, 1, 1, 1]
[NeMo I 2023-07-04 12:09:11 msdd_models:865] Clustering Parameters: {
        "oracle_num_speakers": false,
        "max_num_speakers": 8,
        "enhanced_count_thres": 80,
        "max_rp_threshold": 0.25,
        "sparse_search_volume": 30,
        "maj_vote_spk_count": false
    }
[NeMo W 2023-07-04 12:09:11 clustering_diarizer:411] Deleting previous clustering diarizer outputs.
[NeMo I 2023-07-04 12:09:11 speaker_utils:93] Number of files to diarize: 1
[NeMo I 2023-07-04 12:09:11 clustering_diarizer:309] Split long audio file to avoid CUDA memory issue
splitting manifest: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:02<00:00,  2.19s/it] 
[NeMo I 2023-07-04 12:09:13 vad_utils:101] The prepared manifest file exists. Overwriting!
[NeMo I 2023-07-04 12:09:13 classification_models:263] Perform streaming frame-level VAD
[NeMo I 2023-07-04 12:09:13 collections:298] Filtered duration for loading collection is 0.000000.
[NeMo I 2023-07-04 12:09:13 collections:301] Dataset loaded with 3 items, total duration of  0.03 hours.
[NeMo I 2023-07-04 12:09:13 collections:303] # 3 files loaded accounting to # 1 labels
vad:   0%|                                                                                                                                                                               | 0/3 [00:00<?, ?it/s[ 
NeMo W 2023-07-04 12:09:14 nemo_logging:349] C:\Users\STAGIAIRE\AppData\Local\Programs\Python\Python310\lib\site-packages\torch\amp\autocast_mode.py:204: UserWarning: User provided device_type of 'cuda', but 
CUDA is not available. Disabling
      warnings.warn('User provided device_type of \'cuda\', but CUDA is not available. Disabling')

vad: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:04<00:00,  1.63s/it]
[NeMo I 2023-07-04 12:09:18 clustering_diarizer:250] Generating predictions with overlapping input segments
[NeMo I 2023-07-04 12:09:19 clustering_diarizer:262] Converting frame level prediction to speech/no-speech segment in start and end times format.
creating speech segments: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  3.36it/s] 
Traceback (most recent call last):
  File "D:\Yvan\Perso\PRODUCTIVITE\Diarization\diarize.py", line 112, in <module>
    msdd_model.diarize()
  File "C:\Users\STAGIAIRE\AppData\Local\Programs\Python\Python310\lib\site-packages\torch\utils\_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "C:\Users\STAGIAIRE\AppData\Local\Programs\Python\Python310\lib\site-packages\nemo\collections\asr\models\msdd_models.py", line 1180, in diarize
    self.clustering_embedding.prepare_cluster_embs_infer()
  File "C:\Users\STAGIAIRE\AppData\Local\Programs\Python\Python310\lib\site-packages\nemo\collections\asr\models\msdd_models.py", line 699, in prepare_cluster_embs_infer
    self.emb_sess_test_dict, self.emb_seq_test, self.clus_test_label_dict, _ = self.run_clustering_diarizer(
  File "C:\Users\STAGIAIRE\AppData\Local\Programs\Python\Python310\lib\site-packages\nemo\collections\asr\models\msdd_models.py", line 866, in run_clustering_diarizer
    scores = self.clus_diar_model.diarize(batch_size=self.cfg_diar_infer.batch_size)
  File "C:\Users\STAGIAIRE\AppData\Local\Programs\Python\Python310\lib\site-packages\nemo\collections\asr\models\clustering_diarizer.py", line 437, in diarize
    self._perform_speech_activity_detection()
  File "C:\Users\STAGIAIRE\AppData\Local\Programs\Python\Python310\lib\site-packages\nemo\collections\asr\models\clustering_diarizer.py", line 325, in _perform_speech_activity_detection
    self._run_vad(manifest_vad_input)
  File "C:\Users\STAGIAIRE\AppData\Local\Programs\Python\Python310\lib\site-packages\nemo\collections\asr\models\clustering_diarizer.py", line 281, in _run_vad
    write_rttm2manifest(AUDIO_VAD_RTTM_MAP, self._vad_out_file)
  File "C:\Users\STAGIAIRE\AppData\Local\Programs\Python\Python310\lib\site-packages\nemo\collections\asr\parts\utils\speaker_utils.py", line 867, in write_rttm2manifest
    offset, duration = get_offset_and_duration(AUDIO_RTTM_MAP, uniq_id, decimals)
  File "C:\Users\STAGIAIRE\AppData\Local\Programs\Python\Python310\lib\site-packages\nemo\collections\asr\parts\utils\speaker_utils.py", line 565, in get_offset_and_duration
    duration = sound.frames / sound.samplerate
  File "C:\Users\STAGIAIRE\AppData\Local\Programs\Python\Python310\lib\site-packages\soundfile.py", line 822, in __getattr__
    raise AttributeError(
AttributeError: 'SoundFile' object has no attribute 'frames' 
MahmoudAshraf97 commented 1 year ago

Hi, you are passing cuda as the device but you don't have a gpu available, please fix this and try again