abdeladim-s / subsai

🎞️ Subtitles generation tool (Web-UI + CLI + Python package) powered by OpenAI's Whisper and its variants 🎞️
https://abdeladim-s.github.io/subsai/
GNU General Public License v3.0
1.31k stars 107 forks source link

[bug] whisperX word segmentation fails: KeyError: 'start' #53

Closed ProducerMatt closed 1 year ago

ProducerMatt commented 1 year ago

Running the latest version from the docker image.

Version: 1.1.1
===================================

[-] Model name: m-bain/whisperX
[-] Model configs: {'model_type': 'large-v2', 'segment_type': 'word', 'language': 'en', 'device': 'cpu'}
---
[+] Initializing the model
[2023-07-19 14:10:15.381] [ctranslate2] [thread 7] [warning] The compute type inferred from the saved model is float16, but the target device or backend do not support efficient float16 computation. The model weights have been automatically converted to use the float32 compute type instead.
[14:10:15] INFO     Lightning automatically upgraded your loaded checkpoint from v1.5.4 to v2.0.5. To apply the upgrade to your files      utils.py:128
                    permanently, run `python -m pytorch_lightning.utilities.upgrade_checkpoint --file
                    ../root/.cache/torch/whisperx-vad-segmentation.bin`
Model was trained with pyannote.audio 0.0.1, yours is 2.1.1. Bad things might happen unless you revert pyannote.audio to 0.x.
Model was trained with torch 1.10.0+cu102, yours is 2.0.1. Bad things might happen unless you revert torch to 1.x.
[+] Processing file: /media_files/Marble Hornets Season 1.mp4
Traceback (most recent call last):
  File "/opt/conda/bin/subsai", line 8, in <module>
    sys.exit(main())
  File "/opt/conda/lib/python3.10/site-packages/subsai/cli.py", line 143, in main
    run(media_file_arg=args.media_file,
  File "/opt/conda/lib/python3.10/site-packages/subsai/cli.py", line 87, in run
    subs = subs_ai.transcribe(file, model)
  File "/opt/conda/lib/python3.10/site-packages/subsai/main.py", line 114, in transcribe
    return stt_model.transcribe(media_file)
  File "/opt/conda/lib/python3.10/site-packages/subsai/models/whisperX_model.py", line 146, in transcribe
    event = SSAEvent(start=pysubs2.make_time(s=word["start"]), end=pysubs2.make_time(s=word["end"]))
KeyError: 'start'

It ran for the expected length of time, so it probably finished the encoding and died at the end.

abdeladim-s commented 1 year ago

@ProducerMatt, it seems like the resulting segments does not have timings or was it a silent video maybe ?! Could you please provide the media file you are using so I can debug the issue ? The media files I am using for testing on my end work without any problem ?

ProducerMatt commented 1 year ago

@abdeladim-s Thanks for the response.

The media isn't silent, and I got a great srt out of it when I used sentence splitting. You can find the media file here under Marble Hornets Season 1.mp4

This bug occurs regardless of model size, so for testing you can use tiny.en.

abdeladim-s commented 1 year ago

Thanks @ProducerMatt for providing the file. The problem is the file is 90 minutes long, the transcription will never end on my descent machine :sweat_smile: I have tested some random small segments and it seems good.

What I did for now is to catch the bug and you will get a warning when the program reaches that word. Please rebuild the docker image with the new update and give it a test on your end.

You will now get the resulting srt file, just let me know the fragment of the media file causing this error, so we can investigate why!

ProducerMatt commented 1 year ago

You rock, thanks so much for the quick support. Here's what happened when I ran it:

[+] Processing file: /media_files/Marble Hornets Season 1.mp4
[22:04:14] WARNING  Something wrong with {'word': '20'}                                                                           whisperX_model.py:151
           WARNING  'start'                                                                                                       whisperX_model.py:152
           WARNING  Something wrong with {'word': '15'}                                                                           whisperX_model.py:151
           WARNING  'start'                                                                                                       whisperX_model.py:152
           WARNING  Something wrong with {'word': '20'}                                                                           whisperX_model.py:151
           WARNING  'start'                                                                                                       whisperX_model.py:152
           WARNING  Something wrong with {'word': '10.'}                                                                          whisperX_model.py:151
           WARNING  'start'                                                                                                       whisperX_model.py:152
           WARNING  Something wrong with {'word': '12'}                                                                           whisperX_model.py:151
           WARNING  'start'                                                                                                       whisperX_model.py:152
[+] Subtitles file saved to: /media_files/Marble Hornets Season 1.srt

Here's a .tar.gz of the .srt, which looks completely fine to me. https://clbin.com/Zmxv05

EDIT: sorry, the uploaded archive is truncated. If you can't open it, I'll have to figure out some other way to host it. I could also send it with croc if you have that.

abdeladim-s commented 1 year ago

It's ok, the srt file won't give much details about the cause of the problem, I should've printed the list to see where those "bad words" are located, so we can extract those segments. But from the warnings, it seems like whisperX is generating words without timings (probably the words with only numbers) which I think is a bug from their end!

Anyways, I think the bug is handled gracefully for now, let me know if you find any other issue or if you need help with anything else :)

ProducerMatt commented 1 year ago

Thanks for your help! This is probably the issue: https://github.com/m-bain/whisperX/issues/349

abdeladim-s commented 1 year ago

Yes, you are right, it is the same issue. Do you think it is worth it to use the proposed solutions to fix the issue, or just leave it as it is now until the WhisperX maintainers fix it ?

ProducerMatt commented 1 year ago

@abdeladim-s Sorry for not responding.

Implementing the fix would be nice, but it's maintainer's choice since they may put one in any day. 🙂

Also I've tried the feature, and I'm not clambering to use it right now, haha. It's literally word-per-subtitle, meaning you can barely follow it. If it was like YouTube where it was filling out the subtitle as it was spoken, or it highlighted each word as it was spoken, I would have found use in it. I think another backend has that.

abdeladim-s commented 1 year ago

No problem @ProducerMatt,

Yes you are right :)
In that case I will leave it as is and will wait the maintainers to fix it in their next update.