SYSTRAN / faster-whisper

Faster Whisper transcription with CTranslate2
MIT License
12.2k stars 1.02k forks source link

word-level timestamps #12

Closed eschmidbauer closed 1 year ago

eschmidbauer commented 1 year ago

Hi, I really appreciate you sharing this implementation. I found it to be very fast with accurate results. I do not see word-level timestamps in the result. Are word level timestamps possible?

guillaumekln commented 1 year ago

Hi,

Word-level timestamps are currently not possible. They usually require extensions to the model that are not implemented at this time.

tohe91 commented 1 year ago

Thank you for the amazing work on this! It would be amazing if world level timestamps could be implemented in faster-whisper, once the world-level-timestamps branch is merged to main in whisper

collynce commented 1 year ago

Just checked out the whisper repo and world-level timestamp PR has been merged. I would be great indeed to have the same on faster-whiper.

Great work!

guillaumekln commented 1 year ago

I just pushed an experimental branch implementing word-level timestamps! It would be great if you can test this early.

Note that I implemented exactly the same logic as openai/whisper. So if there is a strange result and openai/whisper has the same result, you should report the issue to openai/whisper and not here.

Here's how you can test this today:

Install the development branch of faster-whisper

pip install --force-reinstall "faster-whisper[conversion] @ https://github.com/guillaumekln/faster-whisper/archive/refs/heads/word-level-timestamps.tar.gz"

Install the development build of CTranslate2

  1. Go to this build page
  2. Download the artifact "python-wheels"
  3. Extract the archive
  4. Install the wheel matching your system and Python version, for example:
pip install --force-reinstall ctranslate2-3.8.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl

Reconvert the model

The model should be converted again with the latest version of CTranslate2 as the configuration needs to be updated with additional information:

ct2-transformers-converter --model openai/whisper-large-v2 --output_dir whisper-large-v2-ct2 --copy_files tokenizer.json --quantization float16

Transcribe with word-level timestamps

segments, _ = model.transcribe(audio_path, word_timestamps=True)

for segment in segments:
    print(segment.words)
eschmidbauer commented 1 year ago

just tested this with the tiny model and it worked! going to do more tests but this is great, thanks so much for sharing!

eschmidbauer commented 1 year ago

large-v2 seems to work too. Thanks again

Jeronymous commented 1 year ago

When I tested word timestamps on a bunch of file, I saw this error happening on some corner case:

  File "/usr/local/lib/python3.10/site-packages/faster_whisper/transcribe.py", line 531, in add_word_timestamps
    alignment = self.find_alignment(tokenizer, text_tokens, mel, num_frames)
  File "/usr/local/lib/python3.10/site-packages/faster_whisper/transcribe.py", line 598, in find_alignment
    start_times = jump_times[word_boundaries[:-1]]
IndexError: index 1 is out of bounds for axis 0 with size 1
guillaumekln commented 1 year ago

Thank you for testing!

Do you confirm the same file works without issue in openai/whisper? If yes, is it possible for you to share this input file?

JulianKropp commented 2 months ago

@guillaumekln First of all, this is very nice!

I have a quick question about the probabilities. Does it indicate how likely it is that this word was spoken, or how likely it is that this word was spoken at the this time in the segment?

I got to this point: https://github.com/SYSTRAN/faster-whisper/blob/d57c5b40b06e59ec44240d93485a95799548af50/faster_whisper/transcribe.py#L1733

Which calls align from CTranslate2

So I think the word_probabilities indicate how likely it is that the word was spoken at the specific time in the segment.

Have you any Idea how to get the probability of how likely it is that a specific word (not considering its timing) was spoken?