Closed kalradivyanshu closed 1 month ago
Did some digging around, I think the words timestamp is the problem.
In these lines: https://github.com/SYSTRAN/faster-whisper/blob/master/faster_whisper/transcribe.py#L1794-L1814
The last word's end timestamp is used for segments end timestamp. Which points to the error being in the find_alignment
function: https://github.com/SYSTRAN/faster-whisper/blob/eb8390233c160a8232abf88f9b949eb5cbc48df8/faster_whisper/transcribe.py#L1820
This can be confirmed by the fact that if I don't send word_timestamps=True
the segment timestamps are correct.
@MahmoudAshraf97
should be fixed in #920
Hey @MahmoudAshraf97 Thank you for your quick response and fix!
I have been able to recreate it in your code, but for a specific case: audio: https://drive.google.com/file/d/1oK0x7OF_JrfWa7ot1-bkTSP1m_tP_vEk/view?usp=sharing
colab: https://colab.research.google.com/drive/1ie7uMFW_LJUvxGHW3KkT5iG8uUZiTVwU?usp=sharing
In this, it works if I use batched_model.transcribe
's VAD, but if I pass in VAD segments specifically: [{'start': 0.0, 'end': 30.0, 'segments': [(0.832, 7.024000000000001)]
It breaks:
[Segment(id=1, seek=2998, start=0.0, end=29.98, text=" Good morning, Hank, it's Tuesday. You know how they're those videos that are like so-and-so answers the web's most searched questions about them, and", tokens=[4599, 3329, 11, 24386, 11, 340, 338, 3431, 13, 921, 760, 703, 484, 821, 883, 5861, 326, 389, 588, 523, 12, 392, 12, 568, 7429, 262, 3992, 338, 749, 16499, 2683, 546, 606, 11, 290], avg_logprob=-0.22146267195542654, compression_ratio=1.2295081967213115, no_speech_prob=0.03997802734375, words=[Word(start=0.0, end=0.1, word=' Good', probability=0.87646484375), Word(start=0.1, end=0.3, word=' morning,', probability=0.91845703125), Word(start=0.34, end=0.46, word=' Hank,', probability=0.409423828125), Word(start=0.46, end=0.62, word=" it's", probability=0.983154296875), Word(start=0.62, end=0.74, word=' Tuesday.', probability=0.99755859375), Word(start=1.02, end=1.02, word=' You', probability=0.97119140625), Word(start=1.02, end=1.1, word=' know', probability=0.99658203125), Word(start=1.1, end=1.22, word=' how', probability=0.87548828125), Word(start=1.22, end=1.4, word=" they're", probability=0.70458984375), Word(start=1.4, end=1.54, word=' those', probability=0.943359375), Word(start=1.54, end=1.86, word=' videos', probability=0.99560546875), Word(start=1.86, end=2.1, word=' that', probability=0.974609375), Word(start=2.1, end=2.22, word=' are', probability=0.9951171875), Word(start=2.22, end=2.38, word=' like', probability=0.66796875), Word(start=2.38, end=2.68, word=' so', probability=0.7060546875), Word(start=2.68, end=2.82, word='-and', probability=0.852294921875), Word(start=2.82, end=29.98, word='-so', probability=0.996337890625), Word(start=29.98, end=29.98, word=' answers', probability=0.98046875), Word(start=29.98, end=29.98, word=' the', probability=0.962890625), Word(start=29.98, end=29.98, word=" web's", probability=0.869140625), Word(start=29.98, end=29.98, word=' most', probability=0.9443359375), Word(start=29.98, end=29.98, word=' searched', probability=0.9453125), Word(start=29.98, end=29.98, word=' questions', probability=0.99365234375), Word(start=29.98, end=29.98, word=' about', probability=0.9990234375), Word(start=29.98, end=29.98, word=' them,', probability=0.9970703125), Word(start=29.98, end=29.98, word=' and', probability=0.9873046875)], temperature=1.0)]
However if I pass the same segments into normal transcribe, using clip_timestamps
:
segments, info = model.transcribe(arr, word_timestamps=True, clip_timestamps = [0.832, 7.024000000000001], vad_filter=False)
That works correctly:
[Segment(id=1, seek=702, start=0.83, end=5.93, text=" You know how they're those videos that are like so-and-so answers the web's most searched questions about them and", tokens=[50363, 921, 760, 703, 484, 821, 883, 5861, 326, 389, 588, 523, 12, 392, 12, 568, 7429, 262, 3992, 338, 749, 16499, 2683, 546, 606, 290, 50619], avg_logprob=-0.22739954824958528, compression_ratio=1.1875, no_speech_prob=0.02056884765625, words=[Word(start=0.83, end=0.99, word=' You', probability=0.37646484375), Word(start=0.99, end=1.11, word=' know', probability=0.99267578125), Word(start=1.11, end=1.23, word=' how', probability=0.81787109375), Word(start=1.23, end=1.37, word=" they're", probability=0.66845703125), Word(start=1.37, end=1.55, word=' those', probability=0.8408203125), Word(start=1.55, end=1.87, word=' videos', probability=0.990234375), Word(start=1.87, end=2.11, word=' that', probability=0.96337890625), Word(start=2.11, end=2.21, word=' are', probability=0.9931640625), Word(start=2.21, end=2.37, word=' like', probability=0.8662109375), Word(start=2.37, end=2.69, word=' so', probability=0.7451171875), Word(start=2.69, end=2.83, word='-and', probability=0.7646484375), Word(start=2.83, end=2.99, word='-so', probability=0.99755859375), Word(start=2.99, end=3.35, word=' answers', probability=0.96826171875), Word(start=3.35, end=3.67, word=' the', probability=0.95703125), Word(start=3.67, end=4.07, word=" web's", probability=0.839111328125), Word(start=4.07, end=4.25, word=' most', probability=0.970703125), Word(start=4.25, end=4.51, word=' searched', probability=0.943359375), Word(start=4.51, end=4.87, word=' questions', probability=0.99462890625), Word(start=4.87, end=5.47, word=' about', probability=0.9970703125), Word(start=5.47, end=5.77, word=' them', probability=0.99755859375), Word(start=5.77, end=5.93, word=' and', probability=0.55078125)], temperature=0.0)]
I think its some edge condition :/
because the input is not the same
the end
of a vad segment should never be more than the end of its last subsegment, in your case the end is 30.0 while the subsegment end is 7.024
this is equivalent to
clip_timestamps = [0.0, 30.0]
Oh ok, my bad, I fixed that, and it is working now!
In the recent code #856,
BatchedInferencePipeline
the segments are sometimes wrong for some reason.I recorded an audio clip: https://drive.google.com/file/d/1cbDbiXi12SIsd0hIDfs61VdtgI78Fg_p/view?usp=sharing
and for this
BatchedInferencePipeline
givesSegment(id=1, seek=2307, start=18.71, end=23.07...
even though the clip is only 3 second long.Here is a colab recreating this issue, just upload batch.wav from the link above: https://colab.research.google.com/drive/1ie7uMFW_LJUvxGHW3KkT5iG8uUZiTVwU?usp=sharing
output:
correct output:
Thank you for all your work!