Running the same code twice giving two different results

bchinnari commented 2 weeks ago

Hi, I am running faster-whisper on an audio file like follows

segments, info = model.transcribe(wav, task="transcribe", language="hi",beam_size=1, word_timestamps=True,max_new_tokens=50  )

The same code sometimes gives two segments and sometimes gives one segment on the same audio file. I find this weird. Is this expected ? whenever this gives 2 segments, second one of those is always "insertions" . there is no speech, but model gives some words as output.

However if I slightly modify the above statement to not output word timestamps like follows

segments, info = model.transcribe(wav, task="transcribe", language="hi",beam_size=1, max_new_tokens=50  )

I always get only one segment in the output with good accuracy. Is the presence of "word_timestamps=True" messing this up ?

bchinnari commented 2 weeks ago

Is this possible ? Did anyone observe this ?

bchinnari commented 2 weeks ago

Ok. Here is what I did. I took a pretrained HF model (https://huggingface.co/vasista22/whisper-hindi-small) and fine-tuned it using my data. Then I converted the checkpoint to faster-whisper format.

If I use "word_timestamps=True" in transcribe function, I am getting extra (useless) segments in the output. I don't know why.

This is not happening if I use whisper model directly for transcription. This is happening with my fine-tuned model only.

bchinnari commented 2 weeks ago

when "word_timestamps=False", the output is as follows Segment(id=1, seek=600, start=0.0, end=6.0, text='सितम्बर 19', tokens=[50364, 45938, 33279, 36158, 48521, 27099, 3941, 105, 25411, 1294], temperature=0.0, avg_logprob=-0.17912933772260492, compression_ratio=0.6857142857142857, no_speech_prob=1.3633834695708693e-14, words=None)

when it is True, the output is like this Segment(id=1, seek=252, start=np.float64(0.0), end=np.float64(2.52), text='सितम्बर 19', tokens=[50364, 45938, 33279, 36158, 48521, 27099, 3941, 105, 25411, 1294], temperature=0.0, avg_logprob=-0.17907507040283896, compression_ratio=0.6857142857142857, no_speech_prob=1.3633834695708693e-14, words=[Word(start=np.float64(0.0), end=np.float64(2.16), word='सितम्बर', probability=np.float64(0.999978095293045)), Word(start=np.float64(2.16), end=np.float64(2.52), word=' 19', probability=np.float64(0.9993481040000916))]) Segment(id=2, seek=496, start=np.float64(2.52), end=np.float64(4.96), text='सितम्बर', tokens=[50364, 45938, 33279, 36158, 48521, 27099, 3941, 105, 25411], temperature=0.0, avg_logprob=-0.4705956637859344, compression_ratio=0.65625, no_speech_prob=0.02291429601609707, words=[Word(start=np.float64(2.52), end=np.float64(4.96), word='सितम्बर', probability=np.float64(0.741854028776288))]) Segment(id=3, seek=598, start=np.float64(4.96), end=np.float64(5.98), text='सितम्बर 19', tokens=[50364, 45938, 33279, 36158, 48521, 27099, 3941, 105, 25411, 1294], temperature=0.0, avg_logprob=-0.4931728406385942, compression_ratio=0.6857142857142857, no_speech_prob=0.2716793119907379, words=[Word(start=np.float64(4.96), end=np.float64(5.98), word='सितम्बर', probability=np.float64(0.7634602943435311)), Word(start=np.float64(5.98), end=np.float64(5.98), word=' 19', probability=np.float64(4.705471383203985e-06))])

when the flag is False, the text is correct and also the number of segments is also correct. But the end of the segment is marked as "6.0" which is incorrect. "6sec" is duration of the wave file.
when the flag is True, the first segment text is correct and end time of the first segment is also correct. But it gave two more segments which is incorrect.

Is there something wrong which is obvious ?

SYSTRAN / faster-whisper

Running the same code twice giving two different results #1085