ggerganov / whisper.cpp

Port of OpenAI's Whisper model in C/C++
MIT License
33.28k stars 3.36k forks source link

Diarization #525

Open bilo1967 opened 1 year ago

bilo1967 commented 1 year ago

Maybe this is not an issue but a design decision but when choosing --diarize, it seems that only output on the screen is being diarized, while output files do not contain the "(Speaker X)" prefix.

Dmitriuso commented 1 year ago

@bilo1967 maybe swaying a topic a littile bit, but how did you encode your audio file to make the diarization work? I tried to encode my audio into 16kHz stereo .wav with ffmpeg, but I got only ? everywhere and no speaker number.

bilo1967 commented 1 year ago

I tried to encode my audio into 16kHz stereo .wav with ffmpeg, but I got only ? everywhere and no speaker number.

I didn't actually do anything. I had some MP3 files from a BBC radio programme. I converted them with FFMPEG and fed them to 'main', which showed on stdout an attempt of diarisation, with question marks but also several nice 'speaker 0' and 'speaker 1'. I tried it also on a file with three speakers discussing one of them in italian, another in spanish and one in portuguese. Although there were many mistakes, there were not so many "?". I don't know if it could depend on the model you're using. I used "medium" for both.

Dmitriuso commented 1 year ago

@bilo1967 thanks for reply. That's strange, because I have many question marks in most cases. What I did, was to download a YT video and extract an audio from it with yt-dlp -xv --audio-format wav -o $SAMPLES/$audio_name.wav $YT_URL, then I transformed it into 16 kHz with ffmpeg -i $SAMPLES/$audio_name.wav -acodec pcm_s16le -ar 16000 $SAMPLES/$audio16khz_name.wav. Did you do the same?

ggerganov commented 1 year ago

The current diarization approach only kind-of works with stereo audio. It does not work if you convert mono audio to stereo audio.

hay commented 1 year ago

Getting back to the original question: i do have the same issue where i can see the speaker roles (when supplying a proper stereo file), but they don't appear in the output files. I've tried vtt, txt, srt and csv, but that doesn't seem to make a difference.

leohuang2013 commented 1 year ago

@ggerganov please help, I did extactly same thing as what @Dmitriuso did,

yt-dlp -xv --audio-format wav -o skillsfuture.wav https://www.youtube.com/watch?v=girQacfWjMw&list=PLH2CR4s1lqyjFm4vQPKT0-hE8sh2T10I1 ffmpeg -i skillsfuture.wav -acodec pcm_s16le -ar 16000 sf.wav ./main -m ../whisper-models/ggml-base.en.bin -di sf.wav

But all question marks for speakers.

leohuang2013 commented 1 year ago

Is that possible, we integrate ECAPA-TDNN model from SpeechBrain into this project, like what following project have done? https://huggingface.co/spaces/vumichien/Whisper_speaker_diarization

Tested with this video, https://www.youtube.com/watch?v=girQacfWjMw&list=PLH2CR4s1lqyjFm4vQPKT0-hE8sh2T10I1 works pretty well. But it is python code.

nahuel89p commented 10 months ago

As a workaround I tried slightly delaying one channel to create a fake pseudostereo track and of course it did't work, got all the speaker ? tags again. What kind of audio alteration would work to do the trick and convert mono to stereo? I know it's very unconventional but maybe it does the trick.

thewh1teagle commented 1 week ago

I created diarization library in Rust based on cpp library. It can be easily rewritten in cpp. It works pretty good. It uses small vad model + speaker verification model. And it's fast. 1 hour in a minute... The only question is how to use the vad detected segments efficiently with whisper.cpp has I heared that it has limit of min 30s.

You can see how simple is the diarization in that repo :) https://github.com/thewh1teagle/sherpa-rs