SYSTRAN / faster-whisper

Faster Whisper transcription with CTranslate2
MIT License
12.31k stars 1.03k forks source link

IndexError: list index out of range in add_word_timestamps function #1118

Open formater opened 1 day ago

formater commented 1 day ago

Hi, I found a rare condition, with a specific wav file, specific language and prompt, when I try to transcribe with word_timestamps=True, there is a list index out of range error in add_word_timestamps function:

  File "/usr/local/src/transcriber/lib/python3.11/site-packages/faster_whisper/transcribe.py", line 1574, in add_word_timestamps
    median_duration, max_duration = median_max_durations[segment_idx]
                                    ~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^
IndexError: list index out of range

It seems in the median_max_durations list we have less elements than in the segments list.

I'm using large-v3-turbo model with these transcibe settings:

segments, _ = asr_model.transcribe(audio_to_analize, language="fr", condition_on_previous_text=False, initial_prompt="Free", task='transcribe', word_timestamps=True, suppress_tokens=[-1, 12], beam_size=5) 
segments = list(segments)  # The transcription will actually run here.

As I see, the median_max_durations is populated from alignments, so something is maybe wrong there? If i change language or prompt, or use another sound file, then there is no issue.

Thank you

MahmoudAshraf97 commented 1 day ago

I'm aware that this error exists but I had no luck in reproducing it, can you write the exact steps to reproduce and upload the audio file?

formater commented 1 day ago

Yes. The sample python code that generates the issue:

import torch
from faster_whisper import WhisperModel

asr_model = WhisperModel("large-v3-turbo", device="cuda", compute_type="int8", download_root="./models")
segments, _ = asr_model.transcribe('test.wav',  language='fr', condition_on_previous_text=False, initial_prompt='Free', task='transcribe', word_timestamps=True, suppress_tokens=[-1, 12], beam_size=5)
segments = list(segments)  # The transcription will actually run here.

And the audio sample is attached. test.zip

MahmoudAshraf97 commented 1 day ago

I was not able to reproduce it on my machine or using colab

formater commented 1 day ago

Maybe python version, debian, pytorch... or something is slightly different on our setups. Anything I can do on my side to get more debug logs to see what is the issue?

MahmoudAshraf97 commented 1 day ago

are you using the master branch? median_max_durations is initialized as an empty list, and since you are using sequential transcription, it will have a single value, The only reason that causes this error is that it is still an empty list which means the for loop in line 1565 was never executed, this will happen when alignments is an empty list, you need to figure why is this happening

https://github.com/SYSTRAN/faster-whisper/blob/203dddb047fd2c3ed2a520fe1416467a527e0f37/faster_whisper/transcribe.py#L1561-L1595