SYSTRAN / faster-whisper

Faster Whisper transcription with CTranslate2
MIT License
10.47k stars 881 forks source link

Numbers, segments issues with most recent 1.0.3 update #888

Open vkras opened 3 weeks ago

vkras commented 3 weeks ago

1.0.2:

Sentence: [101.50s -> 103.62s] Levante, L-Y-V-O-N-T-E. [101.50s -> 101.94s] Levante, [101.94s -> 102.38s] L [102.38s -> 102.66s] -Y [102.66s -> 103.00s] -V [103.00s -> 103.12s] -O [103.12s -> 103.34s] -N [103.34s -> 103.40s] -T [103.40s -> 103.62s] -E. Sentence: [109.20s -> 109.64s] Yep. [109.20s -> 109.64s] Yep. Sentence: [112.75s -> 113.19s] 773. [112.75s -> 113.19s] 773. Sentence: [114.45s -> 114.89s] redacted. [114.45s -> 114.89s] redacted. Sentence: [117.09s -> 117.53s] redacted. [117.09s -> 117.53s] redacted.

1.0.3 (same settings): Sentence: [0 101.40s -> 117.59s] Levante, L-Y-V-O-N-T-E-7-7-3-redacted-redacted-redacted-redacted-redacted. [0 101.40s -> 101.88s] Levante, [0 101.88s -> 102.36s] L [0 102.36s -> 102.64s] -Y [0 102.64s -> 102.98s] -V [0 102.98s -> 103.06s] -O [0 103.06s -> 103.26s] -N [0 103.26s -> 103.40s] -T [0 103.40s -> 103.66s] -E [0 103.66s -> 104.42s] -7 <------ wrong timing [0 112.71s -> 112.97s] -7 [0 112.97s -> 113.21s] -3 [0 113.21s -> 114.45s] -redacted [0 114.45s -> 114.71s] -redacted [0 114.71s -> 114.93s] -redacted [0 114.93s -> 117.07s] -redacted [0 117.07s -> 117.59s] -redacted

1) Numbers were presented better in 1.0.2 2) 2 segments were merged for some reason and the timing is 9 seconds off for the first number.

trungkienbkhn commented 3 weeks ago

@vkras , hello. Could you show full code and attach your example audio ?

vkras commented 3 weeks ago

I was able to cut audio to a small snippet: [1CsvZippedTest-968154446-0c.zip](https://github.com/user-attachments/files/16070315/1CsvZippedTest-968154446-0c.zip)

1.0.2: Sentence: [0 41.56s -> 54.86s] Levante, L-Y-V-O-N-T-E, yep, 773-216. [0 41.56s -> 41.96s] Levante, [0 41.96s -> 42.36s] L [0 42.36s -> 42.66s] -Y [0 42.66s -> 42.98s] -V [0 42.98s -> 43.06s] -O [0 43.06s -> 43.34s] -N [0 43.34s -> 43.42s] -T [0 43.42s -> 43.70s] -E, [0 48.47s -> 49.75s] yep, [0 51.84s -> 53.20s] 773 <- good timing [0 53.20s -> 54.86s] -216.

1.0.3 Sentence: [0 41.56s -> 54.87s] Levante, L-Y-V-O-N-T-E-7-7-3-2-1-6. [0 41.56s -> 41.96s] Levante, [0 41.96s -> 42.36s] L [0 42.36s -> 42.62s] -Y [0 42.62s -> 42.98s] -V [0 42.98s -> 43.08s] -O [0 43.08s -> 43.30s] -N [0 43.30s -> 43.48s] -T [0 43.48s -> 43.68s] -E [0 43.68s -> 44.44s] -7 <- bad timing [0 52.73s -> 52.99s] -7 [0 52.99s -> 53.23s] -3 [0 53.23s -> 54.47s] -2 [0 54.47s -> 54.71s] -1 [0 54.71s -> 54.87s] -6.

The code is pretty straight forward, I'll post relevant pieces:

asr_options_json = "{ \"beam_size\": 5, \"best_of\": 5, \"patience\": 1, \"length_penalty\": 1, \"repetition_penalty\": 1, \"no_repeat_ngram_size\": 0, \"temperature\": [0.0, 0.2, 0.4, 0.6, 0.8, 1.0], \"compression_ratio_threshold\": 2.4, \"log_prob_threshold\": -1.0, \"no_speech_threshold\": 0.6, \"condition_on_previous_text\": true, \"prompt_reset_on_temperature\": 0.5, \"initial_prompt\": null, \"prefix\": null, \"suppress_blank\": true, \"suppress_tokens\": [-1], \"max_initial_timestamp\": 1.0, \"vad_filter\": true, \"vad_parameters\": { \"threshold\": 0.5, \"min_speech_duration_ms\": 250, \"min_silence_duration_ms\": 2000, \"speech_pad_ms\": 400 }, \"max_new_tokens\": null, \"chunk_length\": null, \"clip_timestamps\": \"0\", \"hallucination_silence_threshold\": 1, \"hotwords\": null}"

asr_options_template = json.loads(asr_options_json)

model = WhisperModel('large-v3', device="cpu", device_index=0, compute_type="int8", download_root='/.cache', cpu_threads=4, local_files_only=False)

segments, info = model.transcribe(effective_audio_path, language="en", word_timestamps=True, **asr_options_template)

for segment in segments:
    if debug_mode:
        print("Sentence: [%s %.2fs -> %.2fs] %s" % (channel, segment.start, segment.end, segment.text))
    for word in segment.words:
        print("[%s %.2fs -> %.2fs] %s" % (channel, word.start, word.end, word.word))
trungkienbkhn commented 3 weeks ago

I ran the above audio example in 1.0.2 and 1.0.3. Below are results with debug log level:

FW 1.0.2

2024-07-03 03:33:01,474 - faster_whisper - INFO - Processing audio with duration 00:55.000
2024-07-03 03:33:01,645 - faster_whisper - INFO - VAD filter removed 00:32.320 of audio
2024-07-03 03:33:01,645 - faster_whisper - DEBUG - VAD filter kept the following audio segments: [00:00.000 -> 00:02.768], [00:07.984 -> 00:09.616], [00:11.568 -> 00:16.848], [00:18.288 -> 00:20.944], [00:23.152 -> 00:24.912], [00:26.608 -> 00:27.920], [00:41.072 -> 00:44.176], [00:48.944 -> 00:50.384], [00:52.272 -> 00:55.000]
2024-07-03 03:33:01,731 - faster_whisper - DEBUG - Processing segment at 00:00.000
Sentence: [0 0.00s -> 2.16s]  How much is a basket of fries?
[0 0.00s -> 0.38s]  How
[0 0.38s -> 0.58s]  much
[0 0.58s -> 1.48s]  is
[0 1.48s -> 1.56s]  a
[0 1.56s -> 1.82s]  basket
[0 1.82s -> 1.96s]  of
[0 1.96s -> 2.16s]  fries?
Sentence: [0 7.88s -> 9.04s]  How much is a half pan?
[0 7.88s -> 8.28s]  How
[0 8.28s -> 8.44s]  much
[0 8.44s -> 8.58s]  is
[0 8.58s -> 8.68s]  a
[0 8.68s -> 8.86s]  half
[0 8.86s -> 9.04s]  pan?
Sentence: [0 11.71s -> 12.11s]  Yeah.
[0 11.71s -> 12.11s]  Yeah.
Sentence: [0 14.87s -> 16.13s]  Okay, can I get a half pan?
[0 14.87s -> 15.27s]  Okay,
[0 15.35s -> 15.45s]  can
[0 15.45s -> 15.57s]  I
[0 15.57s -> 15.67s]  get
[0 15.67s -> 15.75s]  a
[0 15.75s -> 15.93s]  half
[0 15.93s -> 16.13s]  pan?
Sentence: [0 18.29s -> 24.18s]  And then can I get a chicken honey barbecue?
[0 18.29s -> 18.69s]  And
[0 18.69s -> 18.83s]  then
[0 18.83s -> 19.01s]  can
[0 19.01s -> 19.19s]  I
[0 19.19s -> 19.41s]  get
[0 19.41s -> 19.67s]  a
[0 19.67s -> 21.43s]  chicken
[0 23.64s -> 23.86s]  honey
[0 23.86s -> 24.18s]  barbecue?
Sentence: [0 26.63s -> 27.35s]  No, that's it.
[0 26.63s -> 27.03s]  No,
[0 27.07s -> 27.31s]  that's
[0 27.31s -> 27.35s]  it.
Sentence: [0 41.56s -> 54.86s]  Levante, L-Y-V-O-N-T-E, yep, 773-216.
[0 41.56s -> 41.96s]  Levante,
[0 41.96s -> 42.36s]  L
[0 42.36s -> 42.66s] -Y
[0 42.66s -> 43.00s] -V
[0 43.00s -> 43.06s] -O
[0 43.06s -> 43.32s] -N
[0 43.32s -> 43.42s] -T
[0 43.42s -> 43.70s] -E,
[0 48.47s -> 49.75s]  yep,
[0 51.82s -> 53.20s]  773
[0 53.20s -> 54.86s] -216.

FW 1.0.3

2024-07-03 03:33:46,042 - faster_whisper - INFO - Processing audio with duration 00:55.000
2024-07-03 03:33:46,279 - faster_whisper - INFO - VAD filter removed 00:35.392 of audio
2024-07-03 03:33:46,279 - faster_whisper - DEBUG - VAD filter kept the following audio segments: [00:00.000 -> 00:02.832], [00:08.016 -> 00:09.648], [00:11.632 -> 00:13.616], [00:14.928 -> 00:16.816], [00:18.256 -> 00:20.976], [00:23.184 -> 00:24.880], [00:26.608 -> 00:27.920], [00:41.168 -> 00:44.240], [00:52.528 -> 00:55.000]
2024-07-03 03:33:46,336 - faster_whisper - DEBUG - Processing segment at 00:00.000
Sentence: [0 0.00s -> 2.16s]  How much is a basket of fries?
[0 0.00s -> 0.38s]  How
[0 0.38s -> 0.58s]  much
[0 0.58s -> 1.48s]  is
[0 1.48s -> 1.56s]  a
[0 1.56s -> 1.82s]  basket
[0 1.82s -> 1.96s]  of
[0 1.96s -> 2.16s]  fries?
Sentence: [0 7.86s -> 9.04s]  How much is a half pan?
[0 7.86s -> 8.26s]  How
[0 8.26s -> 8.44s]  much
[0 8.44s -> 8.56s]  is
[0 8.56s -> 8.68s]  a
[0 8.68s -> 8.84s]  half
[0 8.84s -> 9.04s]  pan?
Sentence: [0 11.69s -> 12.09s]  Yeah.
[0 11.69s -> 12.09s]  Yeah.
Sentence: [0 14.88s -> 16.14s]  Okay, can I get a half pan?
[0 14.88s -> 15.28s]  Okay,
[0 15.34s -> 15.46s]  can
[0 15.46s -> 15.54s]  I
[0 15.54s -> 15.66s]  get
[0 15.66s -> 15.74s]  a
[0 15.74s -> 15.92s]  half
[0 15.92s -> 16.14s]  pan?
Sentence: [0 18.28s -> 24.21s]  And then can I get a Chick's Honey Barbecue?
[0 18.28s -> 18.68s]  And
[0 18.68s -> 18.82s]  then
[0 18.82s -> 19.02s]  can
[0 19.02s -> 19.18s]  I
[0 19.18s -> 19.42s]  get
[0 19.42s -> 19.68s]  a
[0 19.68s -> 21.48s]  Chick's
[0 23.69s -> 23.79s]  Honey
[0 23.79s -> 24.21s]  Barbecue?
Sentence: [0 26.64s -> 27.36s]  No, that's it.
[0 26.64s -> 27.04s]  No,
[0 27.08s -> 27.32s]  that's
[0 27.32s -> 27.36s]  it.
Sentence: [0 41.56s -> 54.85s]  Levante, L-Y-V-O-N-T-E-7-7-3-2-1-6.
[0 41.56s -> 41.96s]  Levante,
[0 41.96s -> 42.36s]  L
[0 42.36s -> 42.64s] -Y
[0 42.64s -> 42.98s] -V
[0 42.98s -> 43.06s] -O
[0 43.06s -> 43.30s] -N
[0 43.30s -> 43.48s] -T
[0 43.48s -> 43.66s] -E
[0 43.66s -> 44.44s] -7
[0 52.73s -> 52.97s] -7
[0 52.97s -> 53.23s] -3
[0 53.23s -> 54.47s] -2
[0 54.47s -> 54.71s] -1
[0 54.71s -> 54.85s] -6.

It seems that silero vad V5 is causing the lack of audio [00:48.944 -> 00:50.384]. @hoonlight , do you have any ideas for this problem ?

hoonlight commented 2 weeks ago

Yes, I will take a look at this issue. I'm not sure, but it seems like it could be related to a change in the internal behavior logic of the VAD, or it could be an issue with the V5 model itself.