whisperX with diarization - KeyError: 'speaker'

foolishgrunt commented 8 months ago

I'm not 100% confident this is a bug rather than a user error, but I've dug through all the relevant documentation I can find and can't find any clues. I've accepted the user terms at https://huggingface.co/pyannote/segmentation-3.0 and https://huggingface.co/pyannote/speaker-diarization-3.1, and I'm passing my access token, so I don't know why it returns this error.

subsai audio.m4a --model m-bain/whisperX --model-configs '{"model_type": "base.en", "speaker_labels": "True", "HF_TOKEN": "[token]"}' --format srt

If I run the same command without the speaker_labels": "True" argument, then I get a nicely formatted .srt file. But whenever I get greedy and try to label the speakers, this is the output I get:

>>WARNING  torchvision is not available - cannot   train_logger.py:264
                    save figures                                               
b"\n\xe2\x96\x88\xe2\x96\x88\xe2\x96\x88\xe2\x96\x88\xe2\x96\x88\xe2\x96\x88\xe2\x96\x88\xe2\x95\x97\xe2\x96\x88\xe2\x96\x88\xe2\x95\x97   \xe2\x96\x88\xe2\x96\x88\xe2\x95\x97\xe2\x96\x88\xe2\x96\x88\xe2\x96\x88\xe2\x96\x88\xe2\x96\x88\xe2\x96\x88\xe2\x95\x97 \xe2\x96\x88\xe2\x96\x88\xe2\x96\x88\xe2\x96\x88\xe2\x96\x88\xe2\x96\x88\xe2\x96\x88\xe2\x95\x97     \xe2\x96\x88\xe2\x96\x88\xe2\x96\x88\xe2\x96\x88\xe2\x96\x88\xe2\x95\x97 \xe2\x96\x88\xe2\x96\x88\xe2\x95\x97\n\xe2\x96\x88\xe2\x96\x88\xe2\x95\x94\xe2\x95\x90\xe2\x95\x90\xe2\x95\x90\xe2\x95\x90\xe2\x95\x9d\xe2\x96\x88\xe2\x96\x88\xe2\x95\x91   \xe2\x96\x88\xe2\x96\x88\xe2\x95\x91\xe2\x96\x88\xe2\x96\x88\xe2\x95\x94\xe2\x95\x90\xe2\x95\x90\xe2\x96\x88\xe2\x96\x88\xe2\x95\x97\xe2\x96\x88\xe2\x96\x88\xe2\x95\x94\xe2\x95\x90\xe2\x95\x90\xe2\x95\x90\xe2\x95\x90\xe2\x95\x9d    \xe2\x96\x88\xe2\x96\x88\xe2\x95\x94\xe2\x95\x90\xe2\x95\x90\xe2\x96\x88\xe2\x96\x88\xe2\x95\x97\xe2\x96\x88\xe2\x96\x88\xe2\x95\x91\n\xe2\x96\x88\xe2\x96\x88\xe2\x96\x88\xe2\x96\x88\xe2\x96\x88\xe2\x96\x88\xe2\x96\x88\xe2\x95\x97\xe2\x96\x88\xe2\x96\x88\xe2\x95\x91   \xe2\x96\x88\xe2\x96\x88\xe2\x95\x91\xe2\x96\x88\xe2\x96\x88\xe2\x96\x88\xe2\x96\x88\xe2\x96\x88\xe2\x96\x88\xe2\x95\x94\xe2\x95\x9d\xe2\x96\x88\xe2\x96\x88\xe2\x96\x88\xe2\x96\x88\xe2\x96\x88\xe2\x96\x88\xe2\x96\x88\xe2\x95\x97    \xe2\x96\x88\xe2\x96\x88\xe2\x96\x88\xe2\x96\x88\xe2\x96\x88\xe2\x96\x88\xe2\x96\x88\xe2\x95\x91\xe2\x96\x88\xe2\x96\x88\xe2\x95\x91\n\xe2\x95\x9a\xe2\x95\x90\xe2\x95\x90\xe2\x95\x90\xe2\x95\x90\xe2\x96\x88\xe2\x96\x88\xe2\x95\x91\xe2\x96\x88\xe2\x96\x88\xe2\x95\x91   \xe2\x96\x88\xe2\x96\x88\xe2\x95\x91\xe2\x96\x88\xe2\x96\x88\xe2\x95\x94\xe2\x95\x90\xe2\x95\x90\xe2\x96\x88\xe2\x96\x88\xe2\x95\x97\xe2\x95\x9a\xe2\x95\x90\xe2\x95\x90\xe2\x95\x90\xe2\x95\x90\xe2\x96\x88\xe2\x96\x88\xe2\x95\x91    \xe2\x96\x88\xe2\x96\x88\xe2\x95\x94\xe2\x95\x90\xe2\x95\x90\xe2\x96\x88\xe2\x96\x88\xe2\x95\x91\xe2\x96\x88\xe2\x96\x88\xe2\x95\x91\n\xe2\x96\x88\xe2\x96\x88\xe2\x96\x88\xe2\x96\x88\xe2\x96\x88\xe2\x96\x88\xe2\x96\x88\xe2\x95\x91\xe2\x95\x9a\xe2\x96\x88\xe2\x96\x88\xe2\x96\x88\xe2\x96\x88\xe2\x96\x88\xe2\x96\x88\xe2\x95\x94\xe2\x95\x9d\xe2\x96\x88\xe2\x96\x88\xe2\x96\x88\xe2\x96\x88\xe2\x96\x88\xe2\x96\x88\xe2\x95\x94\xe2\x95\x9d\xe2\x96\x88\xe2\x96\x88\xe2\x96\x88\xe2\x96\x88\xe2\x96\x88\xe2\x96\x88\xe2\x96\x88\xe2\x95\x91    \xe2\x96\x88\xe2\x96\x88\xe2\x95\x91  \xe2\x96\x88\xe2\x96\x88\xe2\x95\x91\xe2\x96\x88\xe2\x96\x88\xe2\x95\x91\n\xe2\x95\x9a\xe2\x95\x90\xe2\x95\x90\xe2\x95\x90\xe2\x95\x90\xe2\x95\x90\xe2\x95\x90\xe2\x95\x9d \xe2\x95\x9a\xe2\x95\x90\xe2\x95\x90\xe2\x95\x90\xe2\x95\x90\xe2\x95\x90\xe2\x95\x9d \xe2\x95\x9a\xe2\x95\x90\xe2\x95\x90\xe2\x95\x90\xe2\x95\x90\xe2\x95\x90\xe2\x95\x9d \xe2\x95\x9a\xe2\x95\x90\xe2\x95\x90\xe2\x95\x90\xe2\x95\x90\xe2\x95\x90\xe2\x95\x90\xe2\x95\x9d    \xe2\x95\x9a\xe2\x95\x90\xe2\x95\x9d  \xe2\x95\x9a\xe2\x95\x90\xe2\x95\x9d\xe2\x95\x9a\xe2\x95\x90\xe2\x95\x9d\n                                            \nSubs AI: Subtitles generation tool powered by OpenAI's Whisper and its variants.\nVersion: 1.2.5               \n===================================\n"
[-] Model name: m-bain/whisperX
[-] Model configs: {'model_type': 'base.en', 'speaker_labels': 'True', 'HF_TOKEN': '[token]'}
---
[+] Initializing the model
[2024-03-02 21:50:49.987] [ctranslate2] [thread 97942] [warning] The compute type inferred from the saved model is float16, but the target device or backend do not support efficient float16 computation. The model weights have been automatically converted to use the float32 compute type instead.
[21:50:50] INFO     Lightning automatically upgraded your loaded   utils.py:154
                    checkpoint from v1.5.4 to v2.2.0.post0. To                 
                    apply the upgrade to your files permanently,               
                    run `python -m                                             
                    pytorch_lightning.utilities.upgrade_checkpoint             
                    .cache/torch/whisperx-vad-segmentation.bin`                
           INFO     Created a temporary directory at         instantiator.py:21
                    /tmp/tmp0025icqf                                           
           INFO     Writing                                  instantiator.py:76
                    /tmp/tmp0025icqf/_remote_module_non_scri                   
                    ptable.py                                                  
Model was trained with pyannote.audio 0.0.1, yours is 3.1.1. Bad things might happen unless you revert pyannote.audio to 0.x.
Model was trained with torch 1.10.0+cu102, yours is 2.0.1+cu117. Bad things might happen unless you revert torch to 1.x.
b'[+] Processing file: /home/andy/audio.m4a'
Traceback (most recent call last):
  File "/home/andy/.local/bin/subsai", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/home/andy/.local/pipx/venvs/subsai/lib/python3.11/site-packages/subsai/cli.py", line 149, in main
    run(media_file_arg=args.media_file,
  File "/home/andy/.local/pipx/venvs/subsai/lib/python3.11/site-packages/subsai/cli.py", line 88, in run
    subs = subs_ai.transcribe(file, model)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/andy/.local/pipx/venvs/subsai/lib/python3.11/site-packages/subsai/main.py", line 115, in transcribe
    return stt_model.transcribe(media_file)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/andy/.local/pipx/venvs/subsai/lib/python3.11/site-packages/subsai/models/whisperX_model.py", line 161, in transcribe
    name=segment["speaker"] if self.speaker_labels else "")
         ~~~~~~~^^^^^^^^^^^
KeyError: 'speaker'

abdeladim-s commented 8 months ago

@foolishgrunt, The command works without any issue on my end. So if you are sure your HF token is correct and you've accepted the user terms of pyannote, I would suggest to reinstall the project, preferably in a new venv to avoid any cached dependencies.

foolishgrunt commented 8 months ago

Wiped the old venv, reinstalled, and receive the same error. I've double and triple checked that my token is correct, and that I've accepted the terms for segmentation-3.0 and speaker-diarization-3.1 (as instructed in the whisperX readme).

At this point I have to believe there's some other required pyannote library that I haven't yet accepted the terms for, but what I couldn't say. Am I missing something beyond the two mentioned above?

abdeladim-s commented 8 months ago

In that case let us debug it together. Let us start by the file, can you send the file to test it on my end? Or can ou try with another file and see if you get the same problem ?

foolishgrunt commented 8 months ago

Thanks for taking a look at this. The audio file in question is pretty long - it's a recording of a city council meeting, and it's 2.5 hours long. Just now it occurred to me that the length might be part of the issue, so I got a second, much shorter file and ran it with the same parameters as before:

subsai audio2.m4a --model m-bain/whisperX --model-configs '{"model_type": "base.en", "speaker_labels": "True", "HF_TOKEN": "[token]"}' --format srt

The process finishes, and I get my .srt file. So... might the diarization model have a length limit that the first file violates?

However, I don't see any special formatting in the second file: nothing like what I would expect after having set "speaker_labels" to "True." In fact, the formatting seems to be exactly the same as in the .srt file I got from the first audio file when I ran with no speaker_labels:

subsai audio.m4a --model m-bain/whisperX --model-configs '{"model_type": "base.en"}' --format srt

Am I missing something?

EDIT: I answered the second question on my own: I re-exported as .ass instead of .srt, and lo and behold, the file has speaker labels.

abdeladim-s commented 8 months ago

Yes, you need .ass format to see the speaker labels.

The shorter file worked without any issues on my end as well. However, the longer file didn't even finish processing and stuck in a loop, so maybe, as you said, there is some issue in the diarization pipeline when processing long files.

My suggestion now is to try with the whisperX package and see if that gives any results.

Otherwise, I would suggest to cut your long file into smaller pieces (you'll need though to find the maximum length that can be processed without issues) and run the command on the folder containing all the files to batch process them.

Another approach is to use something similar to what I did in this example

VAD example: process long audio files using silero-vad.

You can take a look at the colab notebook to see how it works.

I hope this helps!

foolishgrunt commented 8 months ago

Thanks for the tips. As a quick test, I ran the longer file in one of the demo instances linked to from the whisperX main page, and it also failed with speaker labels selected. So since it now seems pretty clear that this is not a subsai bug, I'll close this now. Many thanks @abdeladim-s!

But on a final note, since I learned that the speaker labels option requires selecting .ass output, the diarization feature suddenly is less appealing to me; my goal is output that is easily human-readable in its raw form, and for this I think .srt is far superior. So since I probably was going to proofread the output anyway, just to have confidence in the results, I think I'll just have to live with adding speaker labels myself at the same time.

abdeladim-s commented 8 months ago

Thanks @foolishgrunt for the quick test. At least we are sure now it's not a bug in the project. I hope the WhisperX authors will fix that soon.

Regarding the speaker labels, the problem lies in the .srt standard format itself because it's very basic and does not support labeling subtitles, compared to the .ass format. However I think I can make a workaround for this, we can just add the speaker label in front of the subtitle, for example

1
start -> end
Speaker_01: Lorem ipsum dolor sit amet, consectetur adipiscing elit,

Or

1
start -> end
(Speaker_01)
Lorem ipsum dolor sit amet, consectetur adipiscing elit,

I think this way you'll see the speaker labels in the srt file as well, What do you think ? Which format do you think is better ?

foolishgrunt commented 8 months ago

Excellent idea! Your first proposal is exactly the format I waa planning to do by hand.

abdeladim-s commented 8 months ago

Perfect. You can now see the speaker labels along the subtitles in srt files. I've tested with your provided short audio and it looks like this: audio2.srt

Please update the package to the latest version and give it a try?

foolishgrunt commented 8 months ago

Done, tested with the same file - looks good here!

As for my long file, I've decided to split it into progressively smaller pieces until I find something short enough that it doesn't error out. Hopefully I find that the limit is still long enough that splitting it into batches doesn't become unworkable.

EDIT: The process just completed for a 53 minute segment, so it looks like ~1 hour is the limit for diarization.

abdeladim-s commented 8 months ago

Sounds good! I know .. cutting the files is not practical but at least 1 hour is not that bad either! I hope they will fix that soon.

Anyways, let me know if you find any other issues.

abdeladim-s / subsai

whisperX with diarization - KeyError: 'speaker' #117