(Small) parts/sentences of transcript get lost when matching transcriptions and diarization

vdsfwreft commented 1 year ago

First off fantastic work. Thanks a lot for creating this. It worked much better than what i had previously tried and i have already used it quite a lot for transcribing podcasts.

It seems small parts of the Whisper-generated transcript get lost when matching transcription and diarization when i tried using it with other youtube-videos. So whisper correctly transcribes the full video, but then in the matching-phase small chunks aren't included in the final transcript.

E.g. the whisper-transcript reads:

"should just rein in their lawyers because they're I'm sure racking up a fortune in legal fees. And they're trying to. It sounds like. Sachs, the case is that they're trying to make"

The final output for the same spot reads

link | 00:07:3.61 [Lex] should just rein in their lawyers because they're I'm link | 00:07:9.88 [Lex] It sounds like. link | 00:07:9.99 [Lex] Sachs, the case is that they're trying to make

-> "sure racking up a fortune in legal fees. And they're trying to." gets lost/is missing.

(It isn't limited to this youtube-video/the specific spot as i tried it with several podcast and happened multiple times in all podcasts. This episode is just the first one i found below one hour of the podcast/randomly chosen.)

The only code-changes i made was changing the youtube-link and changing the audio-file cut-length from 20 to 50 minutes. ("#!yt-dlp -xv --ffmpeg-location ffmpeg-master-latest-linux64-gpl/bin --audio-format wav -o lecun.wav -- https://www.youtube.com/watch?v=GojTj91eLho" was changed to "!yt-dlp -xv --ffmpeg-location ffmpeg-master-latest-linux64-gpl/bin --audio-format wav -o lecun.wav -- https://www.youtube.com/watch?v=jlK5tsUuEP0" and "t2 = 20 60 1000" to "t2 = 50 60 1000", so basically no important code-changes. I have also uploaded the full-code to this Google Colab[https://colab.research.google.com/drive/1BwST1H7sfvgAZ53KNL5QufoEMnPHelLk?usp=sharing], but as i said its barely changed.)

Do you know why this issue arises or/and how it might be fixable?

Thanks in advance for any help and in any case thanks for creating this.

rowbot1 commented 1 year ago

Do you remember how long this part took to run?

DEMO_FILE = {'uri': 'blabal', 'audio': 'audio.wav'}
dz = pipeline(DEMO_FILE)  

with open("diarization.txt", "w") as text_file:
    text_file.write(str(dz))

Mine's still running and its been 12 mins

vdsfwreft commented 1 year ago

Do you remember how long this part took to run?
DEMO_FILE = {'uri': 'blabal', 'audio': 'audio.wav'}
dz = pipeline(DEMO_FILE)  

with open("diarization.txt", "w") as text_file:
    text_file.write(str(dz))
Mine's still running and its been 12 mins

I don't remember exactly, but longer than most of the other steps. (EDIT: Just ran it again and it took "3m"("238.1s") according to Google Colab. The Whisper-part ("!whisper dz.wav --language en --model large") later took took 31m.)

(Are you using (under runtime type) hardware accelerator "GPU" (otherwise the machine learning stuff will like diarization and later Whisper will take forever (instead of maybe 20 minutes) i think)? Link/picture to tutorial for enabling gpu https://www.tutorialspoint.com/google_colab/google_colab_using_free_gpu.htm https://www.tutorialspoint.com/google_colab/images/enabling_gpu.jpg )

(However the issue itself (that "(Small) parts/sentences of transcript get lost when matching transcriptions and diarization") isn't dependent on runtime-duration. When i just ran the code again as expected (and tried previously with other files) transcript parts got lost (including at the example-spot from the issue opening post).)

Majdoddin commented 1 year ago

@vdsfwreft The problem is that whisper does not reliably make a new timestamp after a silent part. This was unexpected. I am working on that.

Majdoddin commented 1 year ago

@vdsfwreft So I rewrote it to cut the audio into segments with a single speaker, according to the diarizaiton of pyannote, and run whisper on each part. Seems to be stable now. Check the updated Colab notebook and the result. And, please consider supporting my work with a donation.

Majdoddin / nlp

(Small) parts/sentences of transcript get lost when matching transcriptions and diarization #1