Open bilo1967 opened 1 year ago
@bilo1967 maybe swaying a topic a littile bit, but how did you encode your audio file to make the diarization work? I tried to encode my audio into 16kHz stereo .wav with ffmpeg, but I got only ?
everywhere and no speaker number.
I tried to encode my audio into 16kHz stereo .wav with ffmpeg, but I got only
?
everywhere and no speaker number.
I didn't actually do anything. I had some MP3 files from a BBC radio programme. I converted them with FFMPEG and fed them to 'main', which showed on stdout an attempt of diarisation, with question marks but also several nice 'speaker 0' and 'speaker 1'. I tried it also on a file with three speakers discussing one of them in italian, another in spanish and one in portuguese. Although there were many mistakes, there were not so many "?". I don't know if it could depend on the model you're using. I used "medium" for both.
@bilo1967 thanks for reply. That's strange, because I have many question marks in most cases. What I did, was to download a YT video and extract an audio from it with yt-dlp -xv --audio-format wav -o $SAMPLES/$audio_name.wav $YT_URL
, then I transformed it into 16 kHz with ffmpeg -i $SAMPLES/$audio_name.wav -acodec pcm_s16le -ar 16000 $SAMPLES/$audio16khz_name.wav
. Did you do the same?
The current diarization approach only kind-of works with stereo audio. It does not work if you convert mono audio to stereo audio.
Getting back to the original question: i do have the same issue where i can see the speaker roles (when supplying a proper stereo file), but they don't appear in the output files. I've tried vtt, txt, srt and csv, but that doesn't seem to make a difference.
@ggerganov please help, I did extactly same thing as what @Dmitriuso did,
yt-dlp -xv --audio-format wav -o skillsfuture.wav https://www.youtube.com/watch?v=girQacfWjMw&list=PLH2CR4s1lqyjFm4vQPKT0-hE8sh2T10I1 ffmpeg -i skillsfuture.wav -acodec pcm_s16le -ar 16000 sf.wav ./main -m ../whisper-models/ggml-base.en.bin -di sf.wav
But all question marks for speakers.
Is that possible, we integrate ECAPA-TDNN model from SpeechBrain into this project, like what following project have done? https://huggingface.co/spaces/vumichien/Whisper_speaker_diarization
Tested with this video, https://www.youtube.com/watch?v=girQacfWjMw&list=PLH2CR4s1lqyjFm4vQPKT0-hE8sh2T10I1 works pretty well. But it is python code.
As a workaround I tried slightly delaying one channel to create a fake pseudostereo track and of course it did't work, got all the speaker ?
tags again. What kind of audio alteration would work to do the trick and convert mono to stereo? I know it's very unconventional but maybe it does the trick.
I created diarization library in Rust based on cpp library. It can be easily rewritten in cpp. It works pretty good. It uses small vad model + speaker verification model. And it's fast. 1 hour in a minute... The only question is how to use the vad detected segments efficiently with whisper.cpp has I heared that it has limit of min 30s.
You can see how simple is the diarization in that repo :) https://github.com/thewh1teagle/sherpa-rs
Maybe this is not an issue but a design decision but when choosing --diarize, it seems that only output on the screen is being diarized, while output files do not contain the "(Speaker X)" prefix.