SYSTRAN / faster-whisper

Faster Whisper transcription with CTranslate2
MIT License
11.25k stars 938 forks source link

Segment timestamps are buggy in BatchedInferencePipeline #919

Closed kalradivyanshu closed 1 month ago

kalradivyanshu commented 1 month ago

In the recent code #856, BatchedInferencePipeline the segments are sometimes wrong for some reason.

I recorded an audio clip: https://drive.google.com/file/d/1cbDbiXi12SIsd0hIDfs61VdtgI78Fg_p/view?usp=sharing

and for this BatchedInferencePipeline gives Segment(id=1, seek=2307, start=18.71, end=23.07... even though the clip is only 3 second long.

Here is a colab recreating this issue, just upload batch.wav from the link above: https://colab.research.google.com/drive/1ie7uMFW_LJUvxGHW3KkT5iG8uUZiTVwU?usp=sharing

segments, info = batched_model.transcribe(arr, word_timestamps=True, batch_size = 1)
print(list(segments))

output:

[Segment(id=1, seek=2307, start=18.71, end=23.07, text=' Hey Michael, how are you?', tokens=[14690, 3899, 11, 703, 389, 345, 30], avg_logprob=-0.21215820871293545, compression_ratio=0.7647058823529411, no_speech_prob=0.08087158203125, words=[Word(start=18.71, end=20.11, word=' Hey', probability=0.71240234375), Word(start=20.11, end=20.11, word=' Michael,', probability=0.732421875), Word(start=22.71, end=22.71, word=' how', probability=0.98046875), Word(start=22.71, end=22.71, word=' are', probability=0.9990234375), Word(start=22.71, end=23.07, word=' you?', probability=0.99853515625)], temperature=1.0)]
segments, info = model.transcribe(arr, word_timestamps=True)
print(list(segments))

correct output:

[Segment(id=1, seek=330, start=0.9999999999999996, end=2.5, text=' Hey Michael, how are you?', tokens=[50413, 14690, 3899, 11, 703, 389, 345, 30, 50513], avg_logprob=-0.568749976158142, compression_ratio=0.7575757575757576, no_speech_prob=0.078125, words=[Word(start=0.9999999999999996, end=1.4, word=' Hey', probability=0.7177734375), Word(start=1.4, end=1.6, word=' Michael,', probability=0.68896484375), Word(start=2.04, end=2.14, word=' how', probability=0.98046875), Word(start=2.14, end=2.32, word=' are', probability=0.998046875), Word(start=2.32, end=2.5, word=' you?', probability=0.99755859375)], temperature=0.0)]

Thank you for all your work!

kalradivyanshu commented 1 month ago

Did some digging around, I think the words timestamp is the problem.

In these lines: https://github.com/SYSTRAN/faster-whisper/blob/master/faster_whisper/transcribe.py#L1794-L1814

The last word's end timestamp is used for segments end timestamp. Which points to the error being in the find_alignment function: https://github.com/SYSTRAN/faster-whisper/blob/eb8390233c160a8232abf88f9b949eb5cbc48df8/faster_whisper/transcribe.py#L1820

This can be confirmed by the fact that if I don't send word_timestamps=True the segment timestamps are correct.

Jiltseb commented 1 month ago

@MahmoudAshraf97

MahmoudAshraf97 commented 1 month ago

should be fixed in #920

kalradivyanshu commented 1 month ago

Hey @MahmoudAshraf97 Thank you for your quick response and fix!

I have been able to recreate it in your code, but for a specific case: audio: https://drive.google.com/file/d/1oK0x7OF_JrfWa7ot1-bkTSP1m_tP_vEk/view?usp=sharing

colab: https://colab.research.google.com/drive/1ie7uMFW_LJUvxGHW3KkT5iG8uUZiTVwU?usp=sharing

In this, it works if I use batched_model.transcribe's VAD, but if I pass in VAD segments specifically: [{'start': 0.0, 'end': 30.0, 'segments': [(0.832, 7.024000000000001)] It breaks:

[Segment(id=1, seek=2998, start=0.0, end=29.98, text=" Good morning, Hank, it's Tuesday. You know how they're those videos that are like so-and-so answers the web's most searched questions about them, and", tokens=[4599, 3329, 11, 24386, 11, 340, 338, 3431, 13, 921, 760, 703, 484, 821, 883, 5861, 326, 389, 588, 523, 12, 392, 12, 568, 7429, 262, 3992, 338, 749, 16499, 2683, 546, 606, 11, 290], avg_logprob=-0.22146267195542654, compression_ratio=1.2295081967213115, no_speech_prob=0.03997802734375, words=[Word(start=0.0, end=0.1, word=' Good', probability=0.87646484375), Word(start=0.1, end=0.3, word=' morning,', probability=0.91845703125), Word(start=0.34, end=0.46, word=' Hank,', probability=0.409423828125), Word(start=0.46, end=0.62, word=" it's", probability=0.983154296875), Word(start=0.62, end=0.74, word=' Tuesday.', probability=0.99755859375), Word(start=1.02, end=1.02, word=' You', probability=0.97119140625), Word(start=1.02, end=1.1, word=' know', probability=0.99658203125), Word(start=1.1, end=1.22, word=' how', probability=0.87548828125), Word(start=1.22, end=1.4, word=" they're", probability=0.70458984375), Word(start=1.4, end=1.54, word=' those', probability=0.943359375), Word(start=1.54, end=1.86, word=' videos', probability=0.99560546875), Word(start=1.86, end=2.1, word=' that', probability=0.974609375), Word(start=2.1, end=2.22, word=' are', probability=0.9951171875), Word(start=2.22, end=2.38, word=' like', probability=0.66796875), Word(start=2.38, end=2.68, word=' so', probability=0.7060546875), Word(start=2.68, end=2.82, word='-and', probability=0.852294921875), Word(start=2.82, end=29.98, word='-so', probability=0.996337890625), Word(start=29.98, end=29.98, word=' answers', probability=0.98046875), Word(start=29.98, end=29.98, word=' the', probability=0.962890625), Word(start=29.98, end=29.98, word=" web's", probability=0.869140625), Word(start=29.98, end=29.98, word=' most', probability=0.9443359375), Word(start=29.98, end=29.98, word=' searched', probability=0.9453125), Word(start=29.98, end=29.98, word=' questions', probability=0.99365234375), Word(start=29.98, end=29.98, word=' about', probability=0.9990234375), Word(start=29.98, end=29.98, word=' them,', probability=0.9970703125), Word(start=29.98, end=29.98, word=' and', probability=0.9873046875)], temperature=1.0)]

However if I pass the same segments into normal transcribe, using clip_timestamps:

segments, info = model.transcribe(arr, word_timestamps=True, clip_timestamps = [0.832, 7.024000000000001], vad_filter=False)

That works correctly:

[Segment(id=1, seek=702, start=0.83, end=5.93, text=" You know how they're those videos that are like so-and-so answers the web's most searched questions about them and", tokens=[50363, 921, 760, 703, 484, 821, 883, 5861, 326, 389, 588, 523, 12, 392, 12, 568, 7429, 262, 3992, 338, 749, 16499, 2683, 546, 606, 290, 50619], avg_logprob=-0.22739954824958528, compression_ratio=1.1875, no_speech_prob=0.02056884765625, words=[Word(start=0.83, end=0.99, word=' You', probability=0.37646484375), Word(start=0.99, end=1.11, word=' know', probability=0.99267578125), Word(start=1.11, end=1.23, word=' how', probability=0.81787109375), Word(start=1.23, end=1.37, word=" they're", probability=0.66845703125), Word(start=1.37, end=1.55, word=' those', probability=0.8408203125), Word(start=1.55, end=1.87, word=' videos', probability=0.990234375), Word(start=1.87, end=2.11, word=' that', probability=0.96337890625), Word(start=2.11, end=2.21, word=' are', probability=0.9931640625), Word(start=2.21, end=2.37, word=' like', probability=0.8662109375), Word(start=2.37, end=2.69, word=' so', probability=0.7451171875), Word(start=2.69, end=2.83, word='-and', probability=0.7646484375), Word(start=2.83, end=2.99, word='-so', probability=0.99755859375), Word(start=2.99, end=3.35, word=' answers', probability=0.96826171875), Word(start=3.35, end=3.67, word=' the', probability=0.95703125), Word(start=3.67, end=4.07, word=" web's", probability=0.839111328125), Word(start=4.07, end=4.25, word=' most', probability=0.970703125), Word(start=4.25, end=4.51, word=' searched', probability=0.943359375), Word(start=4.51, end=4.87, word=' questions', probability=0.99462890625), Word(start=4.87, end=5.47, word=' about', probability=0.9970703125), Word(start=5.47, end=5.77, word=' them', probability=0.99755859375), Word(start=5.77, end=5.93, word=' and', probability=0.55078125)], temperature=0.0)]

I think its some edge condition :/

MahmoudAshraf97 commented 1 month ago

because the input is not the same the end of a vad segment should never be more than the end of its last subsegment, in your case the end is 30.0 while the subsegment end is 7.024 this is equivalent to clip_timestamps = [0.0, 30.0]

kalradivyanshu commented 1 month ago

Oh ok, my bad, I fixed that, and it is working now!