MahmoudAshraf97 / whisper-diarization

Automatic Speech Recognition with Speaker Diarization based on OpenAI Whisper
BSD 2-Clause "Simplified" License
2.53k stars 243 forks source link

about whisperx and diarization.py #140

Closed vladgrand2 closed 7 months ago

vladgrand2 commented 8 months ago

first of all I add to your script --language choose from whisperx

from whisperx.utils import LANGUAGES, TO_LANGUAGE_CODE

parser = argparse.ArgumentParser()

parser.add_argument(
    "--language", 
    type=str, 
    default=None, 
    choices=sorted(LANGUAGES.keys()) + sorted([k.title() for k in TO_LANGUAGE_CODE.keys()]),
    help="Language spoken in the audio, specify None to perform language detection"
)
args = parser.parse_args()

language = args.language

But I wanted to talk about aligment. Whisperx and your script do different alignment using the same method from whisperx. Perhaps the integration into the script is not entirely complete. But I still can’t figure out how to fix this.

Whisperx always gives the same result from 1 file. diarize.py always gives different results from 1 file and always due to aligment

I also noticed that diarize.py always cuts off the beginning of the file. Transcription of speakers starts from 8-11 seconds, and the first seconds are not transcribed.

I am attaching examples of 1 file with whisperx and diarize.py file1_pyannonote.docx file1_nemo.docx

vladgrand2 commented 7 months ago

I'm still trying to make it work better for russian language and noticed that changing Vad parametrs to:

    config.diarizer.vad.parameters.onset = 0.3
    config.diarizer.vad.parameters.offset = 0.1
    config.diarizer.vad.parameters.pad_offset = -0.1

will do process of transcribing better, but stil not ideal.

vladgrand2 commented 7 months ago

Even though I pass a key from the script to whisperx to force language selection, I still get words from other languages ​​in the text. What the original whisperx doesn't do. How can this be overcome?

MahmoudAshraf97 commented 7 months ago

different performance is expected when using different diarization methods, and it's dependent on language so you should use the one that gives you the best results, we use the exact same transcription and alignment as whisperX, the only different step is the diarization

vladgrand2 commented 7 months ago

Do you think that Nemo does this for a bunch of english words and words in other languages when transcribing the russian language?

vladgrand2 commented 7 months ago

Your latest version is simply great. It solved all the problems I encountered. It works just like magic. And the diarization from NeMo just made pyannote. Thank you very much.