MahmoudAshraf97 / whisper-diarization

Automatic Speech Recognition with Speaker Diarization based on OpenAI Whisper
BSD 2-Clause "Simplified" License
2.44k stars 238 forks source link

Audio only part time transcribed and each time a different one? #163

Open Psarpei opened 4 months ago

Psarpei commented 4 months ago

When transcribing a 3min audio with basic parameters and no stem, the resulting .srt file only consists of a part from the original audio sometimes its the start, sometimes the end and sometimes something in between?

Anyone an idea whats wrong here ?

transcriptionstream commented 4 months ago

Any other detail? Version of python in use? Errors?

Psarpei commented 3 months ago

Hey @transcriptionstream thanks for your reply!

Python 3.10 I dont get any errors

60 00:05:28,056 --> 00:05:32,820 Speaker 1: Right now we spend the same amount of compute on each token, a dumb one, or like figuring out some complicated math.

61 00:05:32,820 --> 00:05:33,700 Speaker 1: !

62 00:05:33,700 --> 00:05:36,383 Speaker 0: Subscribe to Unconfuse Me wherever you listen to podcasts.

until 60 everything worked fine and accurate but after that there is a lot of spoken text which is missing and after that comes in the audio the part of 62 so it skipps it

when I repeat it, the skipped audio part differs in length

47 00:04:57,496 --> 00:05:02,519 Speaker 0: So, you know, to generate every new word, it's essentially doing the same thing.

48 00:05:02,519 --> 00:05:33,700 Speaker 0: !

49 00:05:33,700 --> 00:05:36,383 Speaker 0: Subscribe to Unconfuse Me wherever you listen to podcasts.

now the skipped part is way longer but the last sentence is still there

Psarpei commented 3 months ago

Thats the full logging

python diarize.py -a /home/pascal/code/video_translator/data/sent_lvl_sd/bgates_saltmann2/audio_file_enh.wav --whisper-model large-v3 --suppress_numerals --device cuda --language en /home/pascal/anaconda3/envs/whisper_diar_inf/lib/python3.10/site-packages/pyannote/audio/core/io.py:43: UserWarning: torchaudio._backend.set_audio_backend has been deprecated. With dispatcher enabled, this function is no-op. You can remove the function call. torchaudio.set_audio_backend("soundfile") torchvision is not available - cannot save figures [NeMo W 2024-03-27 17:19:05 transformer_bpe_models:59] Could not import NeMo NLP collection which is required for speech translation model. Failed to align segment ("!"): no characters in this segment found in model dictionary, resorting to original... Failed to align segment ("!"): no characters in this segment found in model dictionary, resorting to original... Failed to align segment ("!"): no characters in this segment found in model dictionary, resorting to original... Failed to align segment ("!"): no characters in this segment found in model dictionary, resorting to original... [NeMo I 2024-03-27 17:20:14 msdd_models:1092] Loading pretrained diar_msdd_telephonic model from NGC [NeMo I 2024-03-27 17:20:14 cloud:58] Found existing object /home/pascal/.cache/torch/NeMo/NeMo_1.22.0/diar_msdd_telephonic/3c3697a0a46f945574fa407149975a13/diar_msdd_telephonic.nemo. [NeMo I 2024-03-27 17:20:14 cloud:64] Re-using file from: /home/pascal/.cache/torch/NeMo/NeMo_1.22.0/diar_msdd_telephonic/3c3697a0a46f945574fa407149975a13/diar_msdd_telephonic.nemo [NeMo I 2024-03-27 17:20:14 common:913] Instantiating model from pre-trained checkpoint [NeMo W 2024-03-27 17:20:15 modelPT:161] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader. Train config : manifest_filepath: null emb_dir: null sample_rate: 16000 num_spks: 2 soft_label_thres: 0.5 labels: null batch_size: 15 emb_batch_size: 0 shuffle: true

[NeMo W 2024-03-27 17:20:15 modelPT:168] If you intend to do validation, please call the ModelPT.setup_validation_data() or ModelPT.setup_multiple_validation_data() method and provide a valid configuration file to setup the validation data loader(s). Validation config : manifest_filepath: null emb_dir: null sample_rate: 16000 num_spks: 2 soft_label_thres: 0.5 labels: null batch_size: 15 emb_batch_size: 0 shuffle: false

[NeMo W 2024-03-27 17:20:15 modelPT:174] Please call the ModelPT.setup_test_data() or ModelPT.setup_multiple_test_data() method and provide a valid configuration file to setup the test data loader(s). Test config : manifest_filepath: null emb_dir: null sample_rate: 16000 num_spks: 2 soft_label_thres: 0.5 labels: null batch_size: 15 emb_batch_size: 0 shuffle: false seq_eval_mode: false

[NeMo I 2024-03-27 17:20:15 features:289] PADDING: 16 [NeMo I 2024-03-27 17:20:15 features:289] PADDING: 16 [NeMo I 2024-03-27 17:20:15 audio_preprocessing:517] Numba CUDA SpecAugment kernel is being used [NeMo I 2024-03-27 17:20:15 save_restore_connector:249] Model EncDecDiarLabelModel was successfully restored from /home/pascal/.cache/torch/NeMo/NeMo_1.22.0/diar_msdd_telephonic/3c3697a0a46f945574fa407149975a13/diar_msdd_telephonic.nemo. [NeMo I 2024-03-27 17:20:15 features:289] PADDING: 16 [NeMo I 2024-03-27 17:20:15 audio_preprocessing:517] Numba CUDA SpecAugment kernel is being used [NeMo I 2024-03-27 17:20:15 clustering_diarizer:127] Loading pretrained vad_multilingual_marblenet model from NGC [NeMo I 2024-03-27 17:20:15 cloud:58] Found existing object /home/pascal/.cache/torch/NeMo/NeMo_1.22.0/vad_multilingual_marblenet/670f425c7f186060b7a7268ba6dfacb2/vad_multilingual_marblenet.nemo. [NeMo I 2024-03-27 17:20:15 cloud:64] Re-using file from: /home/pascal/.cache/torch/NeMo/NeMo_1.22.0/vad_multilingual_marblenet/670f425c7f186060b7a7268ba6dfacb2/vad_multilingual_marblenet.nemo [NeMo I 2024-03-27 17:20:15 common:913] Instantiating model from pre-trained checkpoint [NeMo W 2024-03-27 17:20:15 modelPT:161] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader. Train config : manifest_filepath: /manifests/ami_train_0.63.json,/manifests/freesound_background_train.json,/manifests/freesound_laughter_train.json,/manifests/fisher_2004_background.json,/manifests/fisher_2004_speech_sampled.json,/manifests/google_train_manifest.json,/manifests/icsi_all_0.63.json,/manifests/musan_freesound_train.json,/manifests/musan_music_train.json,/manifests/musan_soundbible_train.json,/manifests/mandarin_train_sample.json,/manifests/german_train_sample.json,/manifests/spanish_train_sample.json,/manifests/french_train_sample.json,/manifests/russian_train_sample.json sample_rate: 16000 labels:

[NeMo W 2024-03-27 17:20:15 modelPT:168] If you intend to do validation, please call the ModelPT.setup_validation_data() or ModelPT.setup_multiple_validation_data() method and provide a valid configuration file to setup the validation data loader(s). Validation config : manifest_filepath: /manifests/ami_dev_0.63.json,/manifests/freesound_background_dev.json,/manifests/freesound_laughter_dev.json,/manifests/ch120_moved_0.63.json,/manifests/fisher_2005_500_speech_sampled.json,/manifests/google_dev_manifest.json,/manifests/musan_music_dev.json,/manifests/mandarin_dev.json,/manifests/german_dev.json,/manifests/spanish_dev.json,/manifests/french_dev.json,/manifests/russian_dev.json sample_rate: 16000 labels:

[NeMo W 2024-03-27 17:20:15 modelPT:174] Please call the ModelPT.setup_test_data() or ModelPT.setup_multiple_test_data() method and provide a valid configuration file to setup the test data loader(s). Test config : manifest_filepath: null sample_rate: 16000 labels:

[NeMo I 2024-03-27 17:20:15 features:289] PADDING: 16 [NeMo I 2024-03-27 17:20:15 save_restore_connector:249] Model EncDecClassificationModel was successfully restored from /home/pascal/.cache/torch/NeMo/NeMo_1.22.0/vad_multilingual_marblenet/670f425c7f186060b7a7268ba6dfacb2/vad_multilingual_marblenet.nemo. [NeMo I 2024-03-27 17:20:15 msdd_models:864] Multiscale Weights: [1, 1, 1, 1, 1] [NeMo I 2024-03-27 17:20:15 msdd_models:865] Clustering Parameters: { "oracle_num_speakers": false, "max_num_speakers": 8, "enhanced_count_thres": 80, "max_rp_threshold": 0.25, "sparse_search_volume": 30, "maj_vote_spk_count": false } [NeMo I 2024-03-27 17:20:15 speaker_utils:93] Number of files to diarize: 1 [NeMo I 2024-03-27 17:20:15 clustering_diarizer:309] Split long audio file to avoid CUDA memory issue splitting manifest: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 2.29it/s] [NeMo I 2024-03-27 17:20:16 classification_models:273] Perform streaming frame-level VAD [NeMo I 2024-03-27 17:20:16 collections:445] Filtered duration for loading collection is 0.00 hours. [NeMo I 2024-03-27 17:20:16 collections:446] Dataset loaded with 8 items, total duration of 0.10 hours. [NeMo I 2024-03-27 17:20:16 collections:448] # 8 files loaded accounting to # 1 labels vad: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:01<00:00, 7.35it/s] [NeMo I 2024-03-27 17:20:17 clustering_diarizer:250] Generating predictions with overlapping input segments [NeMo I 2024-03-27 17:20:18 clustering_diarizer:262] Converting frame level prediction to speech/no-speech segment in start and end times format.
creating speech segments: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 6.57it/s] [NeMo I 2024-03-27 17:20:19 clustering_diarizer:287] Subsegmentation for embedding extraction: scale0, /home/pascal/code/chamelaion_inference/speaker_diarization/temp_outputs/speaker_outputs/subsegments_scale0.json [NeMo I 2024-03-27 17:20:19 clustering_diarizer:343] Extracting embeddings for Diarization [NeMo I 2024-03-27 17:20:19 collections:445] Filtered duration for loading collection is 0.00 hours. [NeMo I 2024-03-27 17:20:19 collections:446] Dataset loaded with 343 items, total duration of 0.13 hours. [NeMo I 2024-03-27 17:20:19 collections:448] # 343 files loaded accounting to # 1 labels [1/5] extract embeddings: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 6/6 [00:00<00:00, 10.06it/s] [NeMo I 2024-03-27 17:20:19 clustering_diarizer:389] Saved embedding files to /home/pascal/code/chamelaion_inference/speaker_diarization/temp_outputs/speaker_outputs/embeddings [NeMo I 2024-03-27 17:20:19 clustering_diarizer:287] Subsegmentation for embedding extraction: scale1, /home/pascal/code/chamelaion_inference/speaker_diarization/temp_outputs/speaker_outputs/subsegments_scale1.json [NeMo I 2024-03-27 17:20:19 clustering_diarizer:343] Extracting embeddings for Diarization [NeMo I 2024-03-27 17:20:19 collections:445] Filtered duration for loading collection is 0.00 hours. [NeMo I 2024-03-27 17:20:19 collections:446] Dataset loaded with 420 items, total duration of 0.13 hours. [NeMo I 2024-03-27 17:20:19 collections:448] # 420 files loaded accounting to # 1 labels [2/5] extract embeddings: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 7/7 [00:00<00:00, 12.25it/s] [NeMo I 2024-03-27 17:20:20 clustering_diarizer:389] Saved embedding files to /home/pascal/code/chamelaion_inference/speaker_diarization/temp_outputs/speaker_outputs/embeddings [NeMo I 2024-03-27 17:20:20 clustering_diarizer:287] Subsegmentation for embedding extraction: scale2, /home/pascal/code/chamelaion_inference/speaker_diarization/temp_outputs/speaker_outputs/subsegments_scale2.json [NeMo I 2024-03-27 17:20:20 clustering_diarizer:343] Extracting embeddings for Diarization [NeMo I 2024-03-27 17:20:20 collections:445] Filtered duration for loading collection is 0.00 hours. [NeMo I 2024-03-27 17:20:20 collections:446] Dataset loaded with 535 items, total duration of 0.14 hours. [NeMo I 2024-03-27 17:20:20 collections:448] # 535 files loaded accounting to # 1 labels [3/5] extract embeddings: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 9/9 [00:00<00:00, 13.27it/s] [NeMo I 2024-03-27 17:20:20 clustering_diarizer:389] Saved embedding files to /home/pascal/code/chamelaion_inference/speaker_diarization/temp_outputs/speaker_outputs/embeddings [NeMo I 2024-03-27 17:20:20 clustering_diarizer:287] Subsegmentation for embedding extraction: scale3, /home/pascal/code/chamelaion_inference/speaker_diarization/temp_outputs/speaker_outputs/subsegments_scale3.json [NeMo I 2024-03-27 17:20:21 clustering_diarizer:343] Extracting embeddings for Diarization [NeMo I 2024-03-27 17:20:21 collections:445] Filtered duration for loading collection is 0.00 hours. [NeMo I 2024-03-27 17:20:21 collections:446] Dataset loaded with 722 items, total duration of 0.14 hours. [NeMo I 2024-03-27 17:20:21 collections:448] # 722 files loaded accounting to # 1 labels [4/5] extract embeddings: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 12/12 [00:00<00:00, 15.12it/s] [NeMo I 2024-03-27 17:20:21 clustering_diarizer:389] Saved embedding files to /home/pascal/code/chamelaion_inference/speaker_diarization/temp_outputs/speaker_outputs/embeddings [NeMo I 2024-03-27 17:20:21 clustering_diarizer:287] Subsegmentation for embedding extraction: scale4, /home/pascal/code/chamelaion_inference/speaker_diarization/temp_outputs/speaker_outputs/subsegments_scale4.json [NeMo I 2024-03-27 17:20:21 clustering_diarizer:343] Extracting embeddings for Diarization [NeMo I 2024-03-27 17:20:21 collections:445] Filtered duration for loading collection is 0.00 hours. [NeMo I 2024-03-27 17:20:21 collections:446] Dataset loaded with 1106 items, total duration of 0.15 hours. [NeMo I 2024-03-27 17:20:21 collections:448] # 1106 files loaded accounting to # 1 labels [5/5] extract embeddings: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 18/18 [00:00<00:00, 18.48it/s] [NeMo I 2024-03-27 17:20:22 clustering_diarizer:389] Saved embedding files to /home/pascal/code/chamelaion_inference/speaker_diarization/temp_outputs/speaker_outputs/embeddings clustering: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 2.52it/s] [NeMo I 2024-03-27 17:20:23 clustering_diarizer:464] Outputs are saved in /home/pascal/code/chamelaion_inference/speaker_diarization/temp_outputs directory [NeMo W 2024-03-27 17:20:23 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate [NeMo I 2024-03-27 17:20:23 msdd_models:960] Loading embedding pickle file of scale:0 at /home/pascal/code/chamelaion_inference/speaker_diarization/temp_outputs/speaker_outputs/embeddings/subsegments_scale0_embeddings.pkl [NeMo I 2024-03-27 17:20:23 msdd_models:960] Loading embedding pickle file of scale:1 at /home/pascal/code/chamelaion_inference/speaker_diarization/temp_outputs/speaker_outputs/embeddings/subsegments_scale1_embeddings.pkl [NeMo I 2024-03-27 17:20:23 msdd_models:960] Loading embedding pickle file of scale:2 at /home/pascal/code/chamelaion_inference/speaker_diarization/temp_outputs/speaker_outputs/embeddings/subsegments_scale2_embeddings.pkl [NeMo I 2024-03-27 17:20:23 msdd_models:960] Loading embedding pickle file of scale:3 at /home/pascal/code/chamelaion_inference/speaker_diarization/temp_outputs/speaker_outputs/embeddings/subsegments_scale3_embeddings.pkl [NeMo I 2024-03-27 17:20:23 msdd_models:960] Loading embedding pickle file of scale:4 at /home/pascal/code/chamelaion_inference/speaker_diarization/temp_outputs/speaker_outputs/embeddings/subsegments_scale4_embeddings.pkl [NeMo I 2024-03-27 17:20:23 msdd_models:938] Loading cluster label file from /home/pascal/code/chamelaion_inference/speaker_diarization/temp_outputs/speaker_outputs/subsegments_scale4_cluster.label [NeMo I 2024-03-27 17:20:23 collections:761] Filtered duration for loading collection is 0.000000. [NeMo I 2024-03-27 17:20:23 collections:764] Total 1 session files loaded accounting to # 1 audio clips 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 36.66it/s] [NeMo I 2024-03-27 17:20:23 msdd_models:1403] [Threshold: 0.7000] [use_clus_as_main=False] [diar_window=50] [NeMo I 2024-03-27 17:20:23 speaker_utils:93] Number of files to diarize: 1 [NeMo I 2024-03-27 17:20:23 speaker_utils:93] Number of files to diarize: 1 [NeMo W 2024-03-27 17:20:23 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate [NeMo I 2024-03-27 17:20:23 speaker_utils:93] Number of files to diarize: 1 [NeMo W 2024-03-27 17:20:23 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate [NeMo I 2024-03-27 17:20:23 speaker_utils:93] Number of files to diarize: 1 [NeMo W 2024-03-27 17:20:23 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate [NeMo I 2024-03-27 17:20:23 msdd_models:1431]