Closed formater closed 5 days ago
I'm aware that this error exists but I had no luck in reproducing it, can you write the exact steps to reproduce and upload the audio file?
Yes. The sample python code that generates the issue:
import torch
from faster_whisper import WhisperModel
asr_model = WhisperModel("large-v3-turbo", device="cuda", compute_type="int8", download_root="./models")
segments, _ = asr_model.transcribe('test.wav', language='fr', condition_on_previous_text=False, initial_prompt='Free', task='transcribe', word_timestamps=True, suppress_tokens=[-1, 12], beam_size=5)
segments = list(segments) # The transcription will actually run here.
And the audio sample is attached. test.zip
I was not able to reproduce it on my machine or using colab
Maybe python version, debian, pytorch... or something is slightly different on our setups. Anything I can do on my side to get more debug logs to see what is the issue?
are you using the master branch?
median_max_durations
is initialized as an empty list, and since you are using sequential transcription, it will have a single value, The only reason that causes this error is that it is still an empty list which means the for loop in line 1565 was never executed, this will happen when alignments
is an empty list, you need to figure why is this happening
the same here, while test whisper_streaming
Traceback (most recent call last):
File "C:\Users\kr.mao\AppData\Local\Programs\Python\Python310\lib\runpy.py", line 187, in _run_module_as_main
mod_name, mod_spec, code = _get_module_details(mod_name, _Error)
File "C:\Users\kr.mao\AppData\Local\Programs\Python\Python310\lib\runpy.py", line 110, in _get_module_details
__import__(pkg_name)
File "F:\Workspace\skills\python3\whisper_streaming\whisper_online_server.py", line 183, in <module>
proc.process()
File "F:\Workspace\skills\python3\whisper_streaming\whisper_online_server.py", line 162, in process
o = online.process_iter()
File "F:\Workspace\skills\python3\whisper_streaming\whisper_online.py", line 378, in process_iter
res = self.asr.transcribe(self.audio_buffer, init_prompt=prompt)
File "F:\Workspace\skills\python3\whisper_streaming\whisper_online.py", line 138, in transcribe
return list(segments)
File "F:\Workspace\skills\python3\whisper_streaming\venv\lib\site-packages\faster_whisper\transcribe.py", line 2016, in restore_speech_timestamps
for segment in segments:
File "F:\Workspace\skills\python3\whisper_streaming\venv\lib\site-packages\faster_whisper\transcribe.py", line 1256, in generate_segments
self.add_word_timestamps(
File "F:\Workspace\skills\python3\whisper_streaming\venv\lib\site-packages\faster_whisper\transcribe.py", line 1595, in add_word_timestamps
median_duration, max_duration = median_max_durations[segment_idx]
IndexError: list index out of range
faster_whisper version.py
"""Version information."""
__version__ = "1.1.0rc0"
This problem is still non-reproducible regardless of all methods provided, it will not be solved without reproduction, someone who has the problem needs to create a colab notebook to reproduce it and if they weren't able to reproduce it on colab then they need to isolate where the problem is caused in their environment, without that there is nothing that can be done
This problem is still non-reproducible regardless of all methods provided, it will not be solved without reproduction, someone who has the problem needs to create a colab notebook to reproduce it and if they weren't able to reproduce it on colab then they need to isolate where the problem is caused in their environment, without that there is nothing that can be done
https://gist.github.com/OliveSerg/cc6c409126567a40c94eb94339a13bae
Was able to reproduce it on Colab with the following files test.zip. Was not able to reproduce with @formater's test file though. Files are just a French bible verse from LibriVox and a youtube short.
Used ctranslate2==4.4.0
because of 1806.
Error occurs only when compute_type="int8"
or int8_float16
, task="translate"
, and word_timestamps=True
. No further debugging with the parameters were done aside for replacing these 3.
@MahmoudAshraf97
Maybe related to such weird output (that's from prebug 193 revision of faster-whisper):
{
"id": 279,
"seek": 132430,
"start": 1542.84,
"end": 1545.14,
"text": " Nuðarr你可以 það hverðesskj af april",
"tokens": [51225, 13612, 23436, 289, 81, 42766, 43219, 64, 23436, 276, 331, 23436, 442, 74, 73, 3238, 10992, 388, 51350],
"temperature": 1.0,
"avg_logprob": -4.741359252929687,
"compression_ratio": 1.335164835164835,
"no_speech_prob": 0.12347412109375,
"words": [
{"start": 1542.84, "end": 1542.84, "word": "af", "probability": 0.002758026123046875},
{"start": 1542.84, "end": 1542.84, "word": "aprilð", "probability": 0.057145535945892334},
{"start": 1542.84, "end": 1542.84, "word": "jævîr", "probability": 0.1567896842956543},
{"start": 1542.84, "end": 1542.84, "word": "til", "probability": 0.0018939971923828125},
{"start": 1542.84, "end": 1542.84, "word": "det", "probability": 0.0033779144287109375},
{"start": 1542.84, "end": 1543.44, "word": "bældat", "probability": 0.11750292778015137},
{"start": 1543.44, "end": 1544.36, "word": "brilliant", "probability": 7.152557373046875e-07},
{"start": 1544.36, "end": 1545.14, "word": "með", "probability": 0.2783784866333008}
]
},
{
"id": 280,
"seek": 132430,
"start": 1541.32,
"end": 1543.04,
"text": "ð jævîr til det bældat brilliant með",
"tokens": [51350, 23436, 361, 7303, 85, 7517, 81, 8440, 1141, 272, 7303, 348, 267, 10248, 385, 23436, 51436],
"temperature": 1.0,
"avg_logprob": -4.741359252929687,
"compression_ratio": 1.335164835164835,
"no_speech_prob": 0.12347412109375,
"words": []
},
{
"id": 281,
"seek": 135430,
"start": 1545.14,
"end": 1546.3,
"text": " Duð ena porgna prákankenin.",
"tokens": [50364, 5153, 23436, 465, 64, 1515, 70, 629, 582, 842, 5225, 2653, 259, 13, 50431],
"temperature": 1.0,
"avg_logprob": -4.655551255031784,
"compression_ratio": 1.3051771117166213,
"no_speech_prob": 0.036651611328125,
"words": [
{"start": 1545.14, "end": 1545.36, "word": "Duð", "probability": 0.051422119140625},
{"start": 1545.36, "end": 1545.36, "word": "ena", "probability": 0.010187149047851562},
{"start": 1545.36, "end": 1545.44, "word": "porgna", "probability": 0.004482746124267578},
{"start": 1545.44, "end": 1546.3, "word": "prákankenin.", "probability": 0.04590331315994263}
]
}
https://gist.github.com/OliveSerg/cc6c409126567a40c94eb94339a13bae
Was able to reproduce it on Colab with the following files test.zip. Was not able to reproduce with @formater's test file though. Files are just a French bible verse from LibriVox and a youtube short.
Used
ctranslate2==4.4.0
because of 1806.Error occurs only when
compute_type="int8"
orint8_float16
,task="translate"
, andword_timestamps=True
. No further debugging with the parameters were done aside for replacing these 3.
I managed to reproduce it consistently on colab, I also reproduced it on my machine but not consistently, the reason for inconsistency is that it needs the exact encoder input and generated tokens to reproduce, and using int8
does not guarantee that at least on my hardware(RTX 3070 Ti) so I have to try transcribing several times to reproduce.
What causes the issue is that some segments produce a single timestamp token with no text tokens and that's it, find_alignment
function returned an empty list when no words were found which was fine before #856 , but after it, we're expecting find_alignment
to return a list of lists which happens as long as there are text tokens, but in the edge case where it doesn't it returned a single list and ignores the rest of the loop over other segments in the batch, hence returning less alignments than segments causing the list index out of range
error
I'll open a PR to solve the problem soon
Hi, I found a rare condition, with a specific wav file, specific language and prompt, when I try to transcribe with word_timestamps=True, there is a list index out of range error in add_word_timestamps function:
It seems in the median_max_durations list we have less elements than in the segments list.
I'm using large-v3-turbo model with these transcibe settings:
As I see, the median_max_durations is populated from alignments, so something is maybe wrong there? If i change language or prompt, or use another sound file, then there is no issue.
Thank you