Transcripts4All / tools4all

A curated collection of tools to aid transcriptionists and subtitlers.
https://transcripts4all.github.io
15 stars 0 forks source link

whisper-diarization isn't working for me #1

Closed MHA11 closed 6 months ago

MHA11 commented 6 months ago

Good day everyone.

I’ve been trying to get the latest version of Whisper AI running to no avail. So I follow the instructions and get a Restart Session pop up and wait for the thing to install and then click Restart Session.

After adding the file name to the 2nd command I get the error in the 2nd image. SOmething about a zip file not created.

Now idk anything about programming so please keep that in mind. I’ve been at this for 2 days. On Reddit a kind gentleman recommended changing this line:

!python diarize_parallel.py --whisper-model large-v3 -a “$audioFile”

To this line:

!python diarize_parallel.py --whisper-model large-v3 --batch_size 0 -a “$audioFile”

But it didn’t work.

One last bit of info. The language I’m transcribing is Bengali(abbreviated bn when specifying it) which might be where we’re having an issue. But that’s just a theory.

Google collab error error

Any assistance would be greatly appreciated.

ScriptTiger commented 6 months ago

In the image you've posted, the document is titled "Untitled1.ipynb". That means you've either copied the notebook or just copied the code segments into a different notebook. It's difficult to help troubleshoot if we have no idea what your runtime settings and environment are. If you're just copying and pasting, you're not actually replicating the project 100% because there are also runtime settings and other environment settings which are also part of the notebook, such as the version of Python, the type of processor, etc. If you'd like help, please run the official notebook and we can help troubleshoot issues from there.

ScriptTiger commented 6 months ago

@MHA11 Could you also drop a link to some sample audio you're having this problem with so we can test it out on our end?

MHA11 commented 6 months ago

Script Tiger, are you the same guy I was asking on Reddit?!? I was going to ask around before bothering you again. By official notebook you mean like this? official notebook I just ran it again and got the same error. I copy/paste the entire process below.

How do I find the information you requested? Do the attached images help? Runtime type runtime log

Here is a sample audio: https://drive.google.com/drive/folders/1V-kIy3al2jV9VkbfPubqxkhnpJx5sGgU?usp=drive_link

........................

Step 3 process: 2024-04-25 15:05:33.659041: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered 2024-04-25 15:05:33.659087: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered 2024-04-25 15:05:33.762822: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered 2024-04-25 15:05:35.555381: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT /usr/local/lib/python3.10/dist-packages/pyannote/audio/core/io.py:43: UserWarning: torchaudio._backend.set_audio_backend has been deprecated. With dispatcher enabled, this function is no-op. You can remove the function call. torchaudio.set_audio_backend("soundfile") [NeMo W 2024-04-25 15:05:44 transformer_bpe_models:59] Could not import NeMo NLP collection which is required for speech translation model. Downloading: "https://dl.fbaipublicfiles.com/demucs/hybrid_transformer/955717e8-8726e21a.th" to /root/.cache/torch/hub/checkpoints/955717e8-8726e21a.th 100% 80.2M/80.2M [00:01<00:00, 42.3MB/s] Selected model is a bag of 1 models. You will see that many progress bars per track. Separated tracks will be stored in /content/whisper-diarization/temp_outputs/htdemucs Separating track /content/Nayem 16.04.2024.mp3 100%|██████████████████████████████████████████████████████████████████████| 520.65/520.65 [00:22<00:00, 22.99seconds/s] vocabulary.json: 0% 0.00/1.07M [00:00<?, ?B/s] preprocessor_config.json: 100% 340/340 [00:00<00:00, 1.96MB/s]

tokenizer.json: 0% 0.00/2.48M [00:00<?, ?B/s]

config.json: 100% 2.39k/2.39k [00:00<00:00, 9.72MB/s] vocabulary.json: 100% 1.07M/1.07M [00:00<00:00, 5.40MB/s] model.bin: 0% 0.00/3.09G [00:00<?, ?B/s] tokenizer.json: 100% 2.48M/2.48M [00:00<00:00, 9.72MB/s] model.bin: 28% 860M/3.09G [00:06<00:11, 191MB/s]2024-04-25 15:06:27.842214: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered 2024-04-25 15:06:27.842273: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered 2024-04-25 15:06:27.844085: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered model.bin: 29% 902M/3.09G [00:06<00:11, 192MB/s]2024-04-25 15:06:29.131906: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT model.bin: 49% 1.50G/3.09G [00:13<00:10, 148MB/s]/usr/local/lib/python3.10/dist-packages/pyannote/audio/core/io.py:43: UserWarning: torchaudio._backend.set_audio_backend has been deprecated. With dispatcher enabled, this function is no-op. You can remove the function call. torchaudio.set_audio_backend("soundfile") model.bin: 69% 2.14G/3.09G [00:18<00:05, 162MB/s][NeMo W 2024-04-25 15:06:40 transformer_bpe_models:59] Could not import NeMo NLP collection which is required for speech translation model. model.bin: 71% 2.18G/3.09G [00:19<00:15, 57.5MB/s][NeMo I 2024-04-25 15:06:40 msdd_models:1092] Loading pretrained diar_msdd_telephonic model from NGC [NeMo I 2024-04-25 15:06:40 cloud:68] Downloading from: https://api.ngc.nvidia.com/v2/models/nvidia/nemo/diar_msdd_telephonic/versions/1.0.1/files/diar_msdd_telephonic.nemo to /root/.cache/torch/NeMo/NeMo_1.22.0/diar_msdd_telephonic/3c3697a0a46f945574fa407149975a13/diar_msdd_telephonic.nemo model.bin: 77% 2.37G/3.09G [00:24<00:19, 37.7MB/s][NeMo I 2024-04-25 15:06:46 common:913] Instantiating model from pre-trained checkpoint model.bin: 88% 2.71G/3.09G [00:27<00:02, 155MB/s][NeMo W 2024-04-25 15:06:48 modelPT:161] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader. Train config : manifest_filepath: null emb_dir: null sample_rate: 16000 num_spks: 2 soft_label_thres: 0.5 labels: null batch_size: 15 emb_batch_size: 0 shuffle: true

[NeMo W 2024-04-25 15:06:48 modelPT:168] If you intend to do validation, please call the ModelPT.setup_validation_data() or ModelPT.setup_multiple_validation_data() method and provide a valid configuration file to setup the validation data loader(s). Validation config : manifest_filepath: null emb_dir: null sample_rate: 16000 num_spks: 2 soft_label_thres: 0.5 labels: null batch_size: 15 emb_batch_size: 0 shuffle: false

[NeMo W 2024-04-25 15:06:48 modelPT:174] Please call the ModelPT.setup_test_data() or ModelPT.setup_multiple_test_data() method and provide a valid configuration file to setup the test data loader(s). Test config : manifest_filepath: null emb_dir: null sample_rate: 16000 num_spks: 2 soft_label_thres: 0.5 labels: null batch_size: 15 emb_batch_size: 0 shuffle: false seq_eval_mode: false

[NeMo I 2024-04-25 15:06:48 features:289] PADDING: 16 model.bin: 89% 2.75G/3.09G [00:27<00:02, 159MB/s][NeMo I 2024-04-25 15:06:48 features:289] PADDING: 16 model.bin: 97% 3.01G/3.09G [00:29<00:00, 147MB/s][NeMo I 2024-04-25 15:06:51 save_restore_connector:249] Model EncDecDiarLabelModel was successfully restored from /root/.cache/torch/NeMo/NeMo_1.22.0/diar_msdd_telephonic/3c3697a0a46f945574fa407149975a13/diar_msdd_telephonic.nemo. model.bin: 98% 3.03G/3.09G [00:29<00:00, 137MB/s][NeMo I 2024-04-25 15:06:51 features:289] PADDING: 16 model.bin: 100% 3.09G/3.09G [00:30<00:00, 102MB/s] [NeMo I 2024-04-25 15:06:52 clustering_diarizer:127] Loading pretrained vad_multilingual_marblenet model from NGC [NeMo I 2024-04-25 15:06:52 cloud:68] Downloading from: https://api.ngc.nvidia.com/v2/models/nvidia/nemo/vad_multilingual_marblenet/versions/1.10.0/files/vad_multilingual_marblenet.nemo to /root/.cache/torch/NeMo/NeMo_1.22.0/vad_multilingual_marblenet/670f425c7f186060b7a7268ba6dfacb2/vad_multilingual_marblenet.nemo [NeMo I 2024-04-25 15:06:53 common:913] Instantiating model from pre-trained checkpoint [NeMo W 2024-04-25 15:06:53 modelPT:161] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader. Train config : manifest_filepath: /manifests/ami_train_0.63.json,/manifests/freesound_background_train.json,/manifests/freesound_laughter_train.json,/manifests/fisher_2004_background.json,/manifests/fisher_2004_speech_sampled.json,/manifests/google_train_manifest.json,/manifests/icsi_all_0.63.json,/manifests/musan_freesound_train.json,/manifests/musan_music_train.json,/manifests/musan_soundbible_train.json,/manifests/mandarin_train_sample.json,/manifests/german_train_sample.json,/manifests/spanish_train_sample.json,/manifests/french_train_sample.json,/manifests/russian_train_sample.json sample_rate: 16000 labels:

[NeMo W 2024-04-25 15:06:53 modelPT:168] If you intend to do validation, please call the ModelPT.setup_validation_data() or ModelPT.setup_multiple_validation_data() method and provide a valid configuration file to setup the validation data loader(s). Validation config : manifest_filepath: /manifests/ami_dev_0.63.json,/manifests/freesound_background_dev.json,/manifests/freesound_laughter_dev.json,/manifests/ch120_moved_0.63.json,/manifests/fisher_2005_500_speech_sampled.json,/manifests/google_dev_manifest.json,/manifests/musan_music_dev.json,/manifests/mandarin_dev.json,/manifests/german_dev.json,/manifests/spanish_dev.json,/manifests/french_dev.json,/manifests/russian_dev.json sample_rate: 16000 labels:

[NeMo W 2024-04-25 15:06:53 modelPT:174] Please call the ModelPT.setup_test_data() or ModelPT.setup_multiple_test_data() method and provide a valid configuration file to setup the test data loader(s). Test config : manifest_filepath: null sample_rate: 16000 labels:

[NeMo I 2024-04-25 15:06:53 features:289] PADDING: 16 [NeMo I 2024-04-25 15:06:53 save_restore_connector:249] Model EncDecClassificationModel was successfully restored from /root/.cache/torch/NeMo/NeMo_1.22.0/vad_multilingual_marblenet/670f425c7f186060b7a7268ba6dfacb2/vad_multilingual_marblenet.nemo. [NeMo I 2024-04-25 15:06:54 msdd_models:864] Multiscale Weights: [1, 1, 1, 1, 1] [NeMo I 2024-04-25 15:06:54 msdd_models:865] Clustering Parameters: { "oracle_num_speakers": false, "max_num_speakers": 8, "enhanced_count_thres": 80, "max_rp_threshold": 0.25, "sparse_search_volume": 30, "maj_vote_spk_count": false } [NeMo I 2024-04-25 15:06:54 speaker_utils:93] Number of files to diarize: 1 [NeMo I 2024-04-25 15:06:54 clustering_diarizer:309] Split long audio file to avoid CUDA memory issue splitting manifest: 0% 0/1 [00:00<?, ?it/s]No language specified, language will be first be detected for each audio file (increases inference time). 100%|█████████████████████████████████████| 16.9M/16.9M [00:01<00:00, 10.1MiB/s] Lightning automatically upgraded your loaded checkpoint from v1.5.4 to v2.0.7. To apply the upgrade to your files permanently, run python -m pytorch_lightning.utilities.upgrade_checkpoint --file ../../root/.cache/torch/whisperx-vad-segmentation.bin Model was trained with pyannote.audio 0.0.1, yours is 3.1.1. Bad things might happen unless you revert pyannote.audio to 0.x. Model was trained with torch 1.10.0+cu102, yours is 2.2.1+cu121. Bad things might happen unless you revert torch to 1.x. Detected language: ms (0.15) in first 30s of audio... splitting manifest: 100% 1/1 [00:12<00:00, 12.74s/it] [NeMo I 2024-04-25 15:07:06 classification_models:273] Perform streaming frame-level VAD [NeMo I 2024-04-25 15:07:06 collections:445] Filtered duration for loading collection is 0.00 hours. [NeMo I 2024-04-25 15:07:06 collections:446] Dataset loaded with 11 items, total duration of 0.14 hours. [NeMo I 2024-04-25 15:07:06 collections:448] # 11 files loaded accounting to # 1 labels vad: 100% 11/11 [00:05<00:00, 1.94it/s] [NeMo I 2024-04-25 15:07:12 clustering_diarizer:250] Generating predictions with overlapping input segments generating preds: 0% 0/1 [00:00<?, ?it/s]Traceback (most recent call last): File "/content/whisper-diarization/diarize_parallel.py", line 137, in args.batch_size == 0 # TODO: add a better check for word timestamps existence AssertionError: Unsupported language: ms, use --batch_size to 0 to generate word timestamps using whisper directly and fix this error. [NeMo I 2024-04-25 15:07:19 clustering_diarizer:262] Converting frame level prediction to speech/no-speech segment in start and end times format. creating speech segments: 100% 1/1 [00:00<00:00, 2.16it/s] [NeMo I 2024-04-25 15:07:20 clustering_diarizer:287] Subsegmentation for embedding extraction: scale0, /content/whisper-diarization/temp_outputs/speaker_outputs/subsegments_scale0.json [NeMo I 2024-04-25 15:07:20 clustering_diarizer:343] Extracting embeddings for Diarization [NeMo I 2024-04-25 15:07:20 collections:445] Filtered duration for loading collection is 0.00 hours. [NeMo I 2024-04-25 15:07:20 collections:446] Dataset loaded with 237 items, total duration of 0.07 hours. [NeMo I 2024-04-25 15:07:20 collections:448] # 237 files loaded accounting to # 1 labels [1/5] extract embeddings: 100% 4/4 [00:01<00:00, 3.56it/s] [NeMo I 2024-04-25 15:07:21 clustering_diarizer:389] Saved embedding files to /content/whisper-diarization/temp_outputs/speaker_outputs/embeddings [NeMo I 2024-04-25 15:07:21 clustering_diarizer:287] Subsegmentation for embedding extraction: scale1, /content/whisper-diarization/temp_outputs/speaker_outputs/subsegments_scale1.json [NeMo I 2024-04-25 15:07:21 clustering_diarizer:343] Extracting embeddings for Diarization [NeMo I 2024-04-25 15:07:21 collections:445] Filtered duration for loading collection is 0.00 hours. [NeMo I 2024-04-25 15:07:21 collections:446] Dataset loaded with 279 items, total duration of 0.08 hours. [NeMo I 2024-04-25 15:07:21 collections:448] # 279 files loaded accounting to # 1 labels [2/5] extract embeddings: 100% 5/5 [00:01<00:00, 4.79it/s] [NeMo I 2024-04-25 15:07:22 clustering_diarizer:389] Saved embedding files to /content/whisper-diarization/temp_outputs/speaker_outputs/embeddings [NeMo I 2024-04-25 15:07:22 clustering_diarizer:287] Subsegmentation for embedding extraction: scale2, /content/whisper-diarization/temp_outputs/speaker_outputs/subsegments_scale2.json [NeMo I 2024-04-25 15:07:22 clustering_diarizer:343] Extracting embeddings for Diarization [NeMo I 2024-04-25 15:07:22 collections:445] Filtered duration for loading collection is 0.00 hours. [NeMo I 2024-04-25 15:07:22 collections:446] Dataset loaded with 345 items, total duration of 0.08 hours. [NeMo I 2024-04-25 15:07:22 collections:448] # 345 files loaded accounting to # 1 labels [3/5] extract embeddings: 100% 6/6 [00:00<00:00, 6.60it/s] [NeMo I 2024-04-25 15:07:23 clustering_diarizer:389] Saved embedding files to /content/whisper-diarization/temp_outputs/speaker_outputs/embeddings [NeMo I 2024-04-25 15:07:23 clustering_diarizer:287] Subsegmentation for embedding extraction: scale3, /content/whisper-diarization/temp_outputs/speaker_outputs/subsegments_scale3.json [NeMo I 2024-04-25 15:07:23 clustering_diarizer:343] Extracting embeddings for Diarization [NeMo I 2024-04-25 15:07:23 collections:445] Filtered duration for loading collection is 0.00 hours. [NeMo I 2024-04-25 15:07:23 collections:446] Dataset loaded with 457 items, total duration of 0.09 hours. [NeMo I 2024-04-25 15:07:23 collections:448] # 457 files loaded accounting to # 1 labels [4/5] extract embeddings: 100% 8/8 [00:00<00:00, 8.19it/s] [NeMo I 2024-04-25 15:07:24 clustering_diarizer:389] Saved embedding files to /content/whisper-diarization/temp_outputs/speaker_outputs/embeddings [NeMo I 2024-04-25 15:07:24 clustering_diarizer:287] Subsegmentation for embedding extraction: scale4, /content/whisper-diarization/temp_outputs/speaker_outputs/subsegments_scale4.json [NeMo I 2024-04-25 15:07:24 clustering_diarizer:343] Extracting embeddings for Diarization [NeMo I 2024-04-25 15:07:24 collections:445] Filtered duration for loading collection is 0.00 hours. [NeMo I 2024-04-25 15:07:24 collections:446] Dataset loaded with 693 items, total duration of 0.09 hours. [NeMo I 2024-04-25 15:07:24 collections:448] # 693 files loaded accounting to # 1 labels [5/5] extract embeddings: 100% 11/11 [00:01<00:00, 8.89it/s] [NeMo I 2024-04-25 15:07:25 clustering_diarizer:389] Saved embedding files to /content/whisper-diarization/temp_outputs/speaker_outputs/embeddings clustering: 100% 1/1 [00:01<00:00, 1.00s/it] [NeMo I 2024-04-25 15:07:26 clustering_diarizer:464] Outputs are saved in /content/whisper-diarization/temp_outputs directory [NeMo W 2024-04-25 15:07:26 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate [NeMo I 2024-04-25 15:07:26 msdd_models:960] Loading embedding pickle file of scale:0 at /content/whisper-diarization/temp_outputs/speaker_outputs/embeddings/subsegments_scale0_embeddings.pkl [NeMo I 2024-04-25 15:07:26 msdd_models:960] Loading embedding pickle file of scale:1 at /content/whisper-diarization/temp_outputs/speaker_outputs/embeddings/subsegments_scale1_embeddings.pkl [NeMo I 2024-04-25 15:07:26 msdd_models:960] Loading embedding pickle file of scale:2 at /content/whisper-diarization/temp_outputs/speaker_outputs/embeddings/subsegments_scale2_embeddings.pkl [NeMo I 2024-04-25 15:07:26 msdd_models:960] Loading embedding pickle file of scale:3 at /content/whisper-diarization/temp_outputs/speaker_outputs/embeddings/subsegments_scale3_embeddings.pkl [NeMo I 2024-04-25 15:07:26 msdd_models:960] Loading embedding pickle file of scale:4 at /content/whisper-diarization/temp_outputs/speaker_outputs/embeddings/subsegments_scale4_embeddings.pkl [NeMo I 2024-04-25 15:07:26 msdd_models:938] Loading cluster label file from /content/whisper-diarization/temp_outputs/speaker_outputs/subsegments_scale4_cluster.label [NeMo I 2024-04-25 15:07:26 collections:761] Filtered duration for loading collection is 0.000000. [NeMo I 2024-04-25 15:07:26 collections:764] Total 1 session files loaded accounting to # 1 audio clips 100% 1/1 [00:00<00:00, 11.01it/s] [NeMo I 2024-04-25 15:07:26 msdd_models:1403] [Threshold: 0.7000] [use_clus_as_main=False] [diar_window=50] [NeMo I 2024-04-25 15:07:26 speaker_utils:93] Number of files to diarize: 1 [NeMo I 2024-04-25 15:07:26 speaker_utils:93] Number of files to diarize: 1 [NeMo W 2024-04-25 15:07:27 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate [NeMo I 2024-04-25 15:07:27 speaker_utils:93] Number of files to diarize: 1 [NeMo W 2024-04-25 15:07:27 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate [NeMo I 2024-04-25 15:07:27 speaker_utils:93] Number of files to diarize: 1 [NeMo W 2024-04-25 15:07:27 der:185] Check if each ground truth RTTMs were present in the provided manifest file. Skipping calculation of Diariazation Error Rate [NeMo I 2024-04-25 15:07:27 msdd_models:1431]

zip warning: name not matched: /content/Nayem 16.04.2024.srt
zip warning: name not matched: /content/Nayem 16.04.2024.txt

zip error: Nothing to do! (/content/Nayem 16.04.2024.zip) rm: cannot remove '/content/Nayem 16.04.2024.srt': No such file or directory rm: cannot remove '/content/Nayem 16.04.2024.txt': No such file or directory

ScriptTiger commented 6 months ago

Detected language: ms (0.15) in first 30s of audio...

The language being detected and used is Malay, not Bengali. You will need to force it to use Bengali.

Try this line instead:

!python diarize_parallel.py --whisper-model large-v3 --language bn --batch_size 0 -a “$audioFile”

MHA11 commented 6 months ago

I started the process from the beginning again. Same thing except this time the Step 3 code result is way shorter. 23

ScriptTiger commented 6 months ago

@MHA11 Can you change the permissions on the audio file you linked so "anyone with the link" has "view" permissions?

MHA11 commented 6 months ago

Done.

ScriptTiger commented 6 months ago

Now idk anything about programming so please keep that in mind. I’ve been at this for 2 days. On Reddit a kind gentleman recommended changing this line:

!python diarize_parallel.py --whisper-model large-v3 -a “$audioFile”

To this line:

!python diarize_parallel.py --whisper-model large-v3 --batch_size 0 -a “$audioFile”

@MHA11 My apologies, I didn't catch it before, but "--batchsize" should actually be "--batch-size", so a hyphen ("-") and not an underscore (""). Sorry about that. I just tested it and it works fine.

!python diarize_parallel.py --whisper-model large-v3 --language bn --batch-size 0 -a "$audioFile"

MHA11 commented 6 months ago

No need to apologize, thank you it finally worked! A few points:

The subtitles are not distributed like the other Whisper method. It shows 00:00:01,020 --> 00:08:36,370 and then just a wall of text.

The voice to text recognition is still bad. Not catching a certain sound is perfectly understandable but I feel like it's just making things up. A blue shirt man was never mentioned. Although everything is wrong it is in it's own disturbing way...coherent. Not that it made sense but that it was following a theme. In my last attempt using an enhanced file the conversation was so dark I felt I was reading a thriller novel.

After finishing the original file I did a 2nd attempt with a more 'clean' file. I removed as much extra noise as possible (pots and pans, hmms...etc) The result was mostly the same.

After that I went on Adobes Enhanced Speech software. You get an hour/day for free. The result of the enhancement was mediocre and the Whisper result was even worse.

Is this the end of the line? Are there any settings or something I can do to get better results? Or is it like you said a niche language the AI isn't trained on? I'm not expecting stellar results but this was just sad. I wish it was close enough to make the file searchable.

ScriptTiger commented 6 months ago

The subtitles are not distributed like the other Whisper method. It shows 00:00:01,020 --> 00:08:36,370 and then just a wall of text.

There is some kind of issue where it can't accurately timestamp Bengali, that's why you need to use the "batch-size" argument to basically disable timestamps to get it to work at all. It's basically not trained enough to where it can accurately detect the beginning and ending of certain words and phrases. This may or may not be due to the way Bengali conjugates words and changes base words to other forms, but I can't say for sure since I don't personally know Bengali.

The voice to text recognition is still bad. Not catching a certain sound is perfectly understandable but I feel like it's just making things up. A blue shirt man was never mentioned. Although everything is wrong it is in it's own disturbing way...coherent. Not that it made sense but that it was following a theme. In my last attempt using an enhanced file the conversation was so dark I felt I was reading a thriller novel.

This is actually a common problem with all generative AI known as "hallucinations." Basically, if it can't 100% interpret the data, it just tries to fill in the gaps with whatever it thinks is the closest thing that might make sense, based on what it's learned in its model. In my work as an audio engineer, these hallucinations take the form of random audio "artifacts," which are just completely random sounds which were not in the original audio. You may have run into such artifacts when you tried using Adobe Podcast Enhance Speech.

After finishing the original file I did a 2nd attempt with a more 'clean' file. I removed as much extra noise as possible (pots and pans, hmms...etc) The result was mostly the same.

After that I went on Adobes Enhanced Speech software. You get an hour/day for free. The result of the enhancement was mediocre and the Whisper result was even worse.

These extra steps you took are mostly unnecessary. The project already attempts to process every file through demucs first, which is an audio separation tool which isolates the vocals, and then that new vocals track is what's used to transcribe, not actually the original audio file. However, sometimes demucs fails, in which case the original audio is used instead.

Is this the end of the line? Are there any settings or something I can do to get better results? Or is it like you said a niche language the AI isn't trained on? I'm not expecting stellar results but this was just sad. I wish it was close enough to make the file searchable.

I can't speak for every AI transcription tool out there. However, there's nothing more this particular project can offer you. Unless you train a new model yourself on Bengali, that's pretty much the end of the line for this particular model. Services like Google Translate use their own proprietary models internally, so you will get different results. And the same thing with other proprietary services. So, you could try your hand at shopping around for services that offer more fine-tuned models, especially specifically for Bengali.

MHA11 commented 6 months ago

Thank you ScriptTIger. One last question, do you have any experience with Meta's Massively Multilingual Speech (MMS) Model? The claims is that it's way better than Whisper but knowing Meta.....I tried using it but will do so again later tonight.

ScriptTiger commented 6 months ago

I haven't done anything with MMS yet, although I may add a project for it later. Some projects are more advanced with some things and not as advanced with other things. And each project also has their own unique limitations, such as the duration of audio they can handle and things like that. So, my projects mostly came out of a demand for specific formatting needs and improved punctuation, good accuracy, diarization, etc. So, that's why my projects have a lot of things bolted on to provide for those things. With MMS, I'd have to start from the ground up with a lot of that functionality. But again, I may add it later, if not just for comparison sake. But if it really does turn out to be better, then that's good too. But I'm just not really in a rush since there isn't really too much of a demand since my current stuff is working pretty well for most people. And this is all just being done mostly as a hobby project outside of working, so there are also time and resource restrictions on the development side, as well.

ScriptTiger commented 6 months ago

@MHA11 Just to update you really quick, I did just check out MMS really quick and the results seem worse than Whisper, or at least the modified Whisper we use (WhisperX plus some other stuff) with the V3 model. It seems to have trouble with tenses/conjugations, at least the English results I saw. I only saw a very short sample, but there were a lot of mistakes which our Whisper would just never make. But I'll do some more testing later when I have time and see if it's at all promising to be included here.

ScriptTiger commented 6 months ago

@MHA11 Just one final follow-up on this. I won't be pursuing MMS. There are several ways you can run it with varying accuracy. Running it the "fast" way is terrible and unusable. Running it with all the bells and whistles is better, but still not as accurate as the colab here. So, there's no reason for me to go any further with it at the moment. But I'll check back in and see if maybe they improve it later.