MahmoudAshraf97 / whisper-diarization

Automatic Speech Recognition with Speaker Diarization based on OpenAI Whisper
BSD 2-Clause "Simplified" License
2.53k stars 243 forks source link

TypeError in _get_next_start_timestamp #103

Closed Toby1091 closed 9 months ago

Toby1091 commented 9 months ago

The most recent commits resolved many of the problems I had, thanks a lot for this - your repository is one of the most important elements in my research and saved me weeks of work. After successfully diarizing some files, a new error was thrown:

python diarize.py -a audio_file.mp3 --whisper-model large-v2

Failed to align segment (" Das haben wir nicht einverstanden."): backtrack failed, resorting to original...
Traceback (most recent call last):
  File "/Users/tobiasarbogast/git/audio_transcription/whisper-diarization/diarize.py", line 110, in <module>
    word_timestamps = filter_missing_timestamps(result_aligned["word_segments"])
  File "/Users/tobiasarbogast/git/audio_transcription/whisper-diarization/helpers.py", line 361, in filter_missing_timestamps
    ws["end"] = _get_next_start_timestamp(word_timestamps, i)
  File "/Users/tobiasarbogast/git/audio_transcription/whisper-diarization/helpers.py", line 336, in _get_next_start_timestamp
    " " + word_timestamps[next_word_index]["word"]
TypeError: can only concatenate str (not "NoneType") to str
MahmoudAshraf97 commented 9 months ago

Hi, please provide the audio file so I can reproduce, Thanks

Toby1091 commented 9 months ago

the problem is that is that the file is a somewhat confidential audio from my research but I can share it with you personally if you get in touch (I sent you an invite on LinkedIn)

barbogast commented 9 months ago

Just debugged the issue together with @Toby1091:

The crash occurs when filter_missing_timestamps() encounters multiple word_timestamps entries without a start keyword. For the first entry _get_next_start_timestamp() will set the keyword word of the next entry to None in order to delete it (or rather to mark it as "deleted") (see https://github.com/MahmoudAshraf97/whisper-diarization/blob/main/helpers.py#L326).

When the for loop in filter_missing_timestamps() then processes the second entry it crashes in _get_next_start_timestamp() when trying to concatenate word_timestamps[next_word_index]["word"] (which was set to None) to a string.

Replacing https://github.com/MahmoudAshraf97/whisper-diarization/blob/main/helpers.py#L346 with

ws.get("start") is None and ws.get("word") is not None:

or something along those lines should fix the issue.

We are just now re-running the transcription to confirm. Afterwards we could create a PR?

MahmoudAshraf97 commented 9 months ago

Hi @barbogast , I couldn't replicate the issue with @Toby1091 files so sending a reproduceable file would be great. The function supposedly handles arbitrary chunks of missing timestamps and I've tested it before pushing, but maybe I missed something

MahmoudAshraf97 commented 9 months ago

@barbogast and @Toby1091 , I successfully reproduced the error, will get back with my findings

MahmoudAshraf97 commented 9 months ago

I reached the same conclusion as @barbogast , feel free to open a PR or I can just commit the fix directly whatever suits you, thanks

barbogast commented 9 months ago

I wouldn't mind you just committing the fix. Otherwise I'll create a PR on Tuesday.

MahmoudAshraf97 commented 9 months ago

fixed in 570807ef438ec0e6cad5e5575e8cb208fb183da6

barbogast commented 9 months ago

Great, thanks 👍