SYSTRAN / faster-whisper

Faster Whisper transcription with CTranslate2
MIT License
12.51k stars 1.05k forks source link

Last Word Removal Issue in Persian ASR Chunks and "segment.words" Being "None" in "faster_whisper/transcribe.py" #903

Closed mlhdeep-ai closed 3 months ago

mlhdeep-ai commented 4 months ago

Hi everyone...

In the ASR project for converting Persian audio to Persian text, we need to divide the audio into fixed chunks (e.g., 10 seconds). However, there is a problem: sometimes the audio is split in the middle of the last word, causing errors such as the last word being deleted or repeated.

To address this, we decided to capture the timestamp of each word in a chunk, cut the last word from the current chunk, and add it to the beginning of the next chunk. However, in the segment returned by the transcribe function from Faster Whisper, the value of the "words" field is None. As a result, we cannot access the timestamps of the words in each chunk.

We have two questions:

  1. How can we resolve the issue of the "words" field being None to access the timestamps of the words in each chunk?
  2. Is there a better solution to prevent the last word in each chunk from being deleted or repeated?

Thank you for your attention.

trungkienbkhn commented 4 months ago

@mlhdeep-ai , hello.You can add option word_timestamps=True to transcribe() function:

model = WhisperModel('large-v3', device='cuda')
segments, info = model.transcribe(jfk_path, language="en", word_timestamps=True)
for segment in segments:
    print("Sentence: [%.2fs -> %.2fs] %s" % (segment.start, segment.end, segment.text))
    for word in segment.words:
        print("[%.2fs -> %.2fs] %s" % (word.start, word.end, word.word))
mlhdeep-ai commented 4 months ago

Thanks for your reply. I have reviewed most of the documentation related to your point, but it didn't resolve the issue in my code.

When I omit the "word_timestamps" argument in the “transcribe” function, the transcription of the “Persian” audio file into “Persian” text is done correctly. However, when I include “word_timestamps = True” in the “transcribe” function, the "segments" value returns empty, and as a result, the transcription is not completed.

Here is my code:


model = WhisperModel(model_path, device, compute_type="float16", local_files_only=True)

segments, _ = model.transcribe(str(audio_path), language=language,
                                        word_timestamps = True                     
                                        )   

segments = list(segments)  # The transcription will actually run here.
print("\nsegments:")
print(segments)
print("\n")

for segment in segments:
        text_start = segment.start
        text_end = segment.end
        transcription = segment.text
trungkienbkhn commented 4 months ago

@mlhdeep-ai , you need to remove this logic: segments = list(segments) # The transcription will actually run here. => segments is a generator object. A generator can only be iterated once because it yields items one at a time and doesn’t store them in memory. When you convert a generator to a list, you exhaust it because the conversion iterates over all items.