SYSTRAN / faster-whisper

Faster Whisper transcription with CTranslate2
MIT License
11.92k stars 999 forks source link

How to achieve known text content and obtain the timestamp of the text corresponding to the audio #914

Closed RichardQin1 closed 2 months ago

RichardQin1 commented 3 months ago

It is known that the text is a segment of the audio

eg:

特朗普右耳纏紗布現身
並將在大會上發表全國講話
特朗普表示槍擊事件之後

test.mp3 input(text,test.mp3) output:

特朗普右耳纏紗布現身    start_time:10000 end_time:12000
並將在大會上發表全國講話    start_time:12000 end_time:15000
特朗普表示槍擊事件之後    start_time:15000 end_time:18000

How to obtain the start and end timestamps of each sentence

RichardQin1 commented 3 months ago

plese help!!! thanks

EtienneAb3d commented 3 months ago

See: https://github.com/EtienneAb3d/WhisperTimeSync https://github.com/jianfch/stable-ts?tab=readme-ov-file#alignment

RichardQin1 commented 2 months ago

参见: https://github.com/EtienneAb3d/WhisperTimeSync https://github.com/jianfch/stable-ts?tab=readme-ov-file#alignment

First of all, thank you very much. After trying, I found that I cannot obtain accurate time for short sentence recognition. Is there a more accurate method

EtienneAb3d commented 2 months ago

First of all, thank you very much. After trying, I found that I cannot obtain accurate time for short sentence recognition. Is there a more accurate method

The problem of accuracy is mainly dependent on Whisper itself. You may try with different versions of Whisper, with different sizes. Each may provide you with different results. In my own experiments, playing with parameters never really improve the precision.

You may also gain in precision by applying multiple kinds of processing, like noise filtering or voice compression. See: https://github.com/EtienneAb3d/WhisperHallu

trungkienbkhn commented 2 months ago

@RichardQin1 , hello. You could enable option word_timestamps=True to receive timestamps for each word of the output transcription. And of course the accuracy depends on the whisper model you are using.

model = WhisperModel(model_path)
segments, info = model.transcribe(audio_path, word_timestamps=True)
for segment in segments:
    print("Sentence: [%.2fs -> %.2fs] %s" % (segment.start, segment.end, segment.text))
    for word in segment.words:
        print("[%.2fs -> %.2fs] %s" % (word.start, word.end, word.word))