Open bchinnari opened 2 weeks ago
Is this possible ? Did anyone observe this ?
Ok. Here is what I did. I took a pretrained HF model (https://huggingface.co/vasista22/whisper-hindi-small) and fine-tuned it using my data. Then I converted the checkpoint to faster-whisper format.
If I use "word_timestamps=True" in transcribe function, I am getting extra (useless) segments in the output. I don't know why.
This is not happening if I use whisper model directly for transcription. This is happening with my fine-tuned model only.
when "word_timestamps=False", the output is as follows
Segment(id=1, seek=600, start=0.0, end=6.0, text='सितम्बर 19', tokens=[50364, 45938, 33279, 36158, 48521, 27099, 3941, 105, 25411, 1294], temperature=0.0, avg_logprob=-0.17912933772260492, compression_ratio=0.6857142857142857, no_speech_prob=1.3633834695708693e-14, words=None)
when it is True, the output is like this
Segment(id=1, seek=252, start=np.float64(0.0), end=np.float64(2.52), text='सितम्बर 19', tokens=[50364, 45938, 33279, 36158, 48521, 27099, 3941, 105, 25411, 1294], temperature=0.0, avg_logprob=-0.17907507040283896, compression_ratio=0.6857142857142857, no_speech_prob=1.3633834695708693e-14, words=[Word(start=np.float64(0.0), end=np.float64(2.16), word='सितम्बर', probability=np.float64(0.999978095293045)), Word(start=np.float64(2.16), end=np.float64(2.52), word=' 19', probability=np.float64(0.9993481040000916))])
Segment(id=2, seek=496, start=np.float64(2.52), end=np.float64(4.96), text='सितम्बर', tokens=[50364, 45938, 33279, 36158, 48521, 27099, 3941, 105, 25411], temperature=0.0, avg_logprob=-0.4705956637859344, compression_ratio=0.65625, no_speech_prob=0.02291429601609707, words=[Word(start=np.float64(2.52), end=np.float64(4.96), word='सितम्बर', probability=np.float64(0.741854028776288))])
Segment(id=3, seek=598, start=np.float64(4.96), end=np.float64(5.98), text='सितम्बर 19', tokens=[50364, 45938, 33279, 36158, 48521, 27099, 3941, 105, 25411, 1294], temperature=0.0, avg_logprob=-0.4931728406385942, compression_ratio=0.6857142857142857, no_speech_prob=0.2716793119907379, words=[Word(start=np.float64(4.96), end=np.float64(5.98), word='सितम्बर', probability=np.float64(0.7634602943435311)), Word(start=np.float64(5.98), end=np.float64(5.98), word=' 19', probability=np.float64(4.705471383203985e-06))])
Is there something wrong which is obvious ?
Hi, I am running faster-whisper on an audio file like follows
The same code sometimes gives two segments and sometimes gives one segment on the same audio file. I find this weird. Is this expected ? whenever this gives 2 segments, second one of those is always "insertions" . there is no speech, but model gives some words as output.
However if I slightly modify the above statement to not output word timestamps like follows
I always get only one segment in the output with good accuracy. Is the presence of "word_timestamps=True" messing this up ?