SYSTRAN / faster-whisper

Faster Whisper transcription with CTranslate2
MIT License
12.66k stars 1.06k forks source link

IndexError: list index out of range in add_word_timestamps function #1118

Closed formater closed 5 days ago

formater commented 2 weeks ago

Hi, I found a rare condition, with a specific wav file, specific language and prompt, when I try to transcribe with word_timestamps=True, there is a list index out of range error in add_word_timestamps function:

  File "/usr/local/src/transcriber/lib/python3.11/site-packages/faster_whisper/transcribe.py", line 1574, in add_word_timestamps
    median_duration, max_duration = median_max_durations[segment_idx]
                                    ~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^
IndexError: list index out of range

It seems in the median_max_durations list we have less elements than in the segments list.

I'm using large-v3-turbo model with these transcibe settings:

segments, _ = asr_model.transcribe(audio_to_analize, language="fr", condition_on_previous_text=False, initial_prompt="Free", task='transcribe', word_timestamps=True, suppress_tokens=[-1, 12], beam_size=5) 
segments = list(segments)  # The transcription will actually run here.

As I see, the median_max_durations is populated from alignments, so something is maybe wrong there? If i change language or prompt, or use another sound file, then there is no issue.

Thank you

MahmoudAshraf97 commented 2 weeks ago

I'm aware that this error exists but I had no luck in reproducing it, can you write the exact steps to reproduce and upload the audio file?

formater commented 2 weeks ago

Yes. The sample python code that generates the issue:

import torch
from faster_whisper import WhisperModel

asr_model = WhisperModel("large-v3-turbo", device="cuda", compute_type="int8", download_root="./models")
segments, _ = asr_model.transcribe('test.wav',  language='fr', condition_on_previous_text=False, initial_prompt='Free', task='transcribe', word_timestamps=True, suppress_tokens=[-1, 12], beam_size=5)
segments = list(segments)  # The transcription will actually run here.

And the audio sample is attached. test.zip

MahmoudAshraf97 commented 2 weeks ago

I was not able to reproduce it on my machine or using colab

formater commented 2 weeks ago

Maybe python version, debian, pytorch... or something is slightly different on our setups. Anything I can do on my side to get more debug logs to see what is the issue?

MahmoudAshraf97 commented 2 weeks ago

are you using the master branch? median_max_durations is initialized as an empty list, and since you are using sequential transcription, it will have a single value, The only reason that causes this error is that it is still an empty list which means the for loop in line 1565 was never executed, this will happen when alignments is an empty list, you need to figure why is this happening

https://github.com/SYSTRAN/faster-whisper/blob/203dddb047fd2c3ed2a520fe1416467a527e0f37/faster_whisper/transcribe.py#L1561-L1595

krmao commented 1 week ago

the same here, while test whisper_streaming

Traceback (most recent call last):
  File "C:\Users\kr.mao\AppData\Local\Programs\Python\Python310\lib\runpy.py", line 187, in _run_module_as_main
    mod_name, mod_spec, code = _get_module_details(mod_name, _Error)
  File "C:\Users\kr.mao\AppData\Local\Programs\Python\Python310\lib\runpy.py", line 110, in _get_module_details
    __import__(pkg_name)
  File "F:\Workspace\skills\python3\whisper_streaming\whisper_online_server.py", line 183, in <module>
    proc.process()
  File "F:\Workspace\skills\python3\whisper_streaming\whisper_online_server.py", line 162, in process
    o = online.process_iter()
  File "F:\Workspace\skills\python3\whisper_streaming\whisper_online.py", line 378, in process_iter
    res = self.asr.transcribe(self.audio_buffer, init_prompt=prompt)
  File "F:\Workspace\skills\python3\whisper_streaming\whisper_online.py", line 138, in transcribe
    return list(segments)
  File "F:\Workspace\skills\python3\whisper_streaming\venv\lib\site-packages\faster_whisper\transcribe.py", line 2016, in restore_speech_timestamps
    for segment in segments:
  File "F:\Workspace\skills\python3\whisper_streaming\venv\lib\site-packages\faster_whisper\transcribe.py", line 1256, in generate_segments
    self.add_word_timestamps(
  File "F:\Workspace\skills\python3\whisper_streaming\venv\lib\site-packages\faster_whisper\transcribe.py", line 1595, in add_word_timestamps
    median_duration, max_duration = median_max_durations[segment_idx]
IndexError: list index out of range

faster_whisper version.py

"""Version information."""

__version__ = "1.1.0rc0"
MahmoudAshraf97 commented 1 week ago

This problem is still non-reproducible regardless of all methods provided, it will not be solved without reproduction, someone who has the problem needs to create a colab notebook to reproduce it and if they weren't able to reproduce it on colab then they need to isolate where the problem is caused in their environment, without that there is nothing that can be done

OliveSerg commented 6 days ago

This problem is still non-reproducible regardless of all methods provided, it will not be solved without reproduction, someone who has the problem needs to create a colab notebook to reproduce it and if they weren't able to reproduce it on colab then they need to isolate where the problem is caused in their environment, without that there is nothing that can be done

https://gist.github.com/OliveSerg/cc6c409126567a40c94eb94339a13bae

Was able to reproduce it on Colab with the following files test.zip. Was not able to reproduce with @formater's test file though. Files are just a French bible verse from LibriVox and a youtube short.

Used ctranslate2==4.4.0 because of 1806.

Error occurs only when compute_type="int8" or int8_float16, task="translate", and word_timestamps=True. No further debugging with the parameters were done aside for replacing these 3.

Purfview commented 5 days ago

@MahmoudAshraf97

Maybe related to such weird output (that's from prebug 193 revision of faster-whisper):

    {
        "id": 279,
        "seek": 132430,
        "start": 1542.84,
        "end": 1545.14,
        "text": " Nuðarr你可以 það hverðesskj af april",
        "tokens": [51225, 13612, 23436, 289, 81, 42766, 43219, 64, 23436, 276, 331, 23436, 442, 74, 73, 3238, 10992, 388, 51350],
        "temperature": 1.0,
        "avg_logprob": -4.741359252929687,
        "compression_ratio": 1.335164835164835,
        "no_speech_prob": 0.12347412109375,
        "words": [
            {"start": 1542.84, "end": 1542.84, "word": "af", "probability": 0.002758026123046875},
            {"start": 1542.84, "end": 1542.84, "word": "aprilð", "probability": 0.057145535945892334},
            {"start": 1542.84, "end": 1542.84, "word": "jævîr", "probability": 0.1567896842956543},
            {"start": 1542.84, "end": 1542.84, "word": "til", "probability": 0.0018939971923828125},
            {"start": 1542.84, "end": 1542.84, "word": "det", "probability": 0.0033779144287109375},
            {"start": 1542.84, "end": 1543.44, "word": "bældat", "probability": 0.11750292778015137},
            {"start": 1543.44, "end": 1544.36, "word": "brilliant", "probability": 7.152557373046875e-07},
            {"start": 1544.36, "end": 1545.14, "word": "með", "probability": 0.2783784866333008}
        ]
    },
    {
        "id": 280,
        "seek": 132430,
        "start": 1541.32,
        "end": 1543.04,
        "text": "ð jævîr til det bældat brilliant með",
        "tokens": [51350, 23436, 361, 7303, 85, 7517, 81, 8440, 1141, 272, 7303, 348, 267, 10248, 385, 23436, 51436],
        "temperature": 1.0,
        "avg_logprob": -4.741359252929687,
        "compression_ratio": 1.335164835164835,
        "no_speech_prob": 0.12347412109375,
        "words": []
    },
    {
        "id": 281,
        "seek": 135430,
        "start": 1545.14,
        "end": 1546.3,
        "text": " Duð ena porgna prákankenin.",
        "tokens": [50364, 5153, 23436, 465, 64, 1515, 70, 629, 582, 842, 5225, 2653, 259, 13, 50431],
        "temperature": 1.0,
        "avg_logprob": -4.655551255031784,
        "compression_ratio": 1.3051771117166213,
        "no_speech_prob": 0.036651611328125,
        "words": [
            {"start": 1545.14, "end": 1545.36, "word": "Duð", "probability": 0.051422119140625},
            {"start": 1545.36, "end": 1545.36, "word": "ena", "probability": 0.010187149047851562},
            {"start": 1545.36, "end": 1545.44, "word": "porgna", "probability": 0.004482746124267578},
            {"start": 1545.44, "end": 1546.3, "word": "prákankenin.", "probability": 0.04590331315994263}
        ]
    }
MahmoudAshraf97 commented 5 days ago

https://gist.github.com/OliveSerg/cc6c409126567a40c94eb94339a13bae

Was able to reproduce it on Colab with the following files test.zip. Was not able to reproduce with @formater's test file though. Files are just a French bible verse from LibriVox and a youtube short.

Used ctranslate2==4.4.0 because of 1806.

Error occurs only when compute_type="int8" or int8_float16, task="translate", and word_timestamps=True. No further debugging with the parameters were done aside for replacing these 3.

I managed to reproduce it consistently on colab, I also reproduced it on my machine but not consistently, the reason for inconsistency is that it needs the exact encoder input and generated tokens to reproduce, and using int8 does not guarantee that at least on my hardware(RTX 3070 Ti) so I have to try transcribing several times to reproduce.

What causes the issue is that some segments produce a single timestamp token with no text tokens and that's it, find_alignment function returned an empty list when no words were found which was fine before #856 , but after it, we're expecting find_alignment to return a list of lists which happens as long as there are text tokens, but in the edge case where it doesn't it returned a single list and ignores the rest of the loop over other segments in the batch, hence returning less alignments than segments causing the list index out of range error

I'll open a PR to solve the problem soon