Closed foolishgrunt closed 8 months ago
@foolishgrunt, The command works without any issue on my end.
So if you are sure your HF token is correct and you've accepted the user terms of pyannote
, I would suggest to reinstall the project, preferably in a new venv to avoid any cached dependencies.
Wiped the old venv, reinstalled, and receive the same error. I've double and triple checked that my token is correct, and that I've accepted the terms for segmentation-3.0
and speaker-diarization-3.1
(as instructed in the whisperX readme).
At this point I have to believe there's some other required pyannote
library that I haven't yet accepted the terms for, but what I couldn't say. Am I missing something beyond the two mentioned above?
In that case let us debug it together. Let us start by the file, can you send the file to test it on my end? Or can ou try with another file and see if you get the same problem ?
Thanks for taking a look at this. The audio file in question is pretty long - it's a recording of a city council meeting, and it's 2.5 hours long. Just now it occurred to me that the length might be part of the issue, so I got a second, much shorter file and ran it with the same parameters as before:
subsai audio2.m4a --model m-bain/whisperX --model-configs '{"model_type": "base.en", "speaker_labels": "True", "HF_TOKEN": "[token]"}' --format srt
The process finishes, and I get my .srt file. So... might the diarization model have a length limit that the first file violates?
However, I don't see any special formatting in the second file: nothing like what I would expect after having set "speaker_labels" to "True." In fact, the formatting seems to be exactly the same as in the .srt file I got from the first audio file when I ran with no speaker_labels:
subsai audio.m4a --model m-bain/whisperX --model-configs '{"model_type": "base.en"}' --format srt
Am I missing something?
EDIT: I answered the second question on my own: I re-exported as .ass instead of .srt, and lo and behold, the file has speaker labels.
Yes, you need .ass
format to see the speaker labels.
The shorter file worked without any issues on my end as well. However, the longer file didn't even finish processing and stuck in a loop, so maybe, as you said, there is some issue in the diarization pipeline when processing long files.
My suggestion now is to try with the whisperX
package and see if that gives any results.
Otherwise, I would suggest to cut your long file into smaller pieces (you'll need though to find the maximum length that can be processed without issues) and run the command on the folder containing all the files to batch process them.
Another approach is to use something similar to what I did in this example
- VAD example: process long audio files using silero-vad.
You can take a look at the colab notebook to see how it works.
I hope this helps!
Thanks for the tips. As a quick test, I ran the longer file in one of the demo instances linked to from the whisperX main page, and it also failed with speaker labels selected. So since it now seems pretty clear that this is not a subsai
bug, I'll close this now. Many thanks @abdeladim-s!
But on a final note, since I learned that the speaker labels option requires selecting .ass
output, the diarization feature suddenly is less appealing to me; my goal is output that is easily human-readable in its raw form, and for this I think .srt is far superior. So since I probably was going to proofread the output anyway, just to have confidence in the results, I think I'll just have to live with adding speaker labels myself at the same time.
Thanks @foolishgrunt for the quick test. At least we are sure now it's not a bug in the project. I hope the WhisperX
authors will fix that soon.
Regarding the speaker labels, the problem lies in the .srt
standard format itself because it's very basic and does not support labeling subtitles, compared to the .ass
format. However I think I can make a workaround for this, we can just add the speaker label in front of the subtitle, for example
1
start -> end
Speaker_01: Lorem ipsum dolor sit amet, consectetur adipiscing elit,
Or
1
start -> end
(Speaker_01)
Lorem ipsum dolor sit amet, consectetur adipiscing elit,
I think this way you'll see the speaker labels in the srt file as well, What do you think ? Which format do you think is better ?
Excellent idea! Your first proposal is exactly the format I waa planning to do by hand.
Perfect. You can now see the speaker labels along the subtitles in srt files. I've tested with your provided short audio and it looks like this: audio2.srt
Please update the package to the latest version and give it a try?
Done, tested with the same file - looks good here!
As for my long file, I've decided to split it into progressively smaller pieces until I find something short enough that it doesn't error out. Hopefully I find that the limit is still long enough that splitting it into batches doesn't become unworkable.
EDIT: The process just completed for a 53 minute segment, so it looks like ~1 hour is the limit for diarization.
Sounds good! I know .. cutting the files is not practical but at least 1 hour is not that bad either! I hope they will fix that soon.
Anyways, let me know if you find any other issues.
I'm not 100% confident this is a bug rather than a user error, but I've dug through all the relevant documentation I can find and can't find any clues. I've accepted the user terms at https://huggingface.co/pyannote/segmentation-3.0 and https://huggingface.co/pyannote/speaker-diarization-3.1, and I'm passing my access token, so I don't know why it returns this error.
subsai audio.m4a --model m-bain/whisperX --model-configs '{"model_type": "base.en", "speaker_labels": "True", "HF_TOKEN": "[token]"}' --format srt
If I run the same command without the
speaker_labels": "True"
argument, then I get a nicely formatted .srt file. But whenever I get greedy and try to label the speakers, this is the output I get: