Vaibhavs10 / insanely-fast-whisper

Apache License 2.0
7.61k stars 533 forks source link

word timestamps crashes #40

Open eschmidbauer opened 11 months ago

eschmidbauer commented 11 months ago

when specifying word timestamps on a 3m 45s file, I am seeing a crash

insanely-fast-whisper --file-name test.wav --timestamp word
You are attempting to use Flash Attention 2.0 with a model initialized on CPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Traceback (most recent call last):
  File "insanely-fast-whisper", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "insanely_fast_whisper/cli.py", line 101, in main
    outputs = pipe(
              ^^^^^
  File "transformers/pipelines/automatic_speech_recognition.py", line 357, in __call__
    return super().__call__(inputs, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "transformers/pipelines/base.py", line 1132, in __call__
    return next(
           ^^^^^
  File "transformers/pipelines/pt_utils.py", line 124, in __next__
    item = next(self.iterator)
           ^^^^^^^^^^^^^^^^^^^
  File "transformers/pipelines/pt_utils.py", line 266, in __next__
    processed = self.infer(next(self.iterator), **self.params)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "transformers/pipelines/base.py", line 1046, in forward
    model_outputs = self._forward(model_inputs, **forward_params)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "transformers/pipelines/automatic_speech_recognition.py", line 552, in _forward
    generate_kwargs["num_frames"] = stride[0] // self.feature_extractor.hop_length
                                    ~~~~~~~~~~^^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
TypeError: unsupported operand type(s) for //: 'tuple' and 'int'
eschmidbauer commented 11 months ago

this could be related to https://github.com/huggingface/transformers/issues/27446

sanchit-gandhi commented 11 months ago

It is indeed related - will be fixed by https://github.com/huggingface/transformers/pull/26699.

Vaibhavs10 commented 11 months ago

In the meantime @eschmidbauer you should be able to do the following:

insanely-fast-whisper --file-name test.wav --timestamp word --batch-size 1

It'll be slower but would do the job!

Vaibhavs10 commented 11 months ago

(leaving this open till we patch this in transformers)

bluusun commented 10 months ago

Is batch-size 1 24x slower than running the default prompt?

Pranjalya commented 10 months ago

On testing via T4 GPU, while https://github.com/huggingface/transformers/pull/26699 does fix the issue, the GPU consumption ramps up quite significantly. While processing a 3 min 19 audio file, without word timestamps, it is able to complete the process with 24 batch size (default) under 30 seconds, with GPU consumption ~9.5 GiB. But, with the word timestamps, it causes GPU OOM, with 24 batch size, the best I was able to do with 16 GB T4 GPU was batch size as 2, which took around 2.5 minutes, while still GPU consumption with this batch size going to around ~10 GiB.

msj121 commented 10 months ago

@Vaibhavs10 were you able to use word level timestamp, what are your speeds? @Pranjalya seems to show not much speed improvement with word level timestamps?

Vaibhavs10 commented 9 months ago

Hi @msj121 - This is still open: https://github.com/huggingface/transformers/pull/26699. The team at HF is relatively small, and we're handling quite a lot of maintenance. This is a priority, tho. I will make sure to keep you posted.

filip-alexandrov commented 9 months ago

Just fixed by https://github.com/huggingface/transformers/pull/28114.

gnm3000 commented 9 months ago

I didnt find a good solution for this. I need timestamp every word

Just fixed by huggingface/transformers#28114.

I installed the latest transformers version, but I still getting the same error

ArmykOliva commented 1 month ago

This is still not fixed. It returns an error: ValueError: WhisperFlashAttention2 attention does not support output_attentions

Galileon commented 1 month ago

hello Meybe it wil help U - i am not py developer so code looks little funny but it works:


from nltk.tokenize import RegexpTokenizer
 tokenizer = RegexpTokenizer(r'\w+')    
            results = []

            for item in segments['chunks']:
                t1,t2 = item['timestamp'][0],item['timestamp'][1]
                oriLen = len(item['text'])
                oritext = item['text']
                timelen = t2 - t1

                parts = tokenizer.tokenize (item['text'])
                for part in parts:
                    partLen ,partindex = len(part) ,oritext.index(part)
                    perttime = t1 + partindex * (timelen / oriLen)
                    results.append(json.loads('{"Text":"%s","Time":"%f"}' % (part, perttime)))

            return results,end - start, segments['text']
msj121 commented 1 month ago

@Galileon An interesting approach but keep in mind you are guessing at word level timestamps - a simple solution, but not necessarily good if you want the actual timestamp of the word. You could imagine a dynamic speaker emphasizing different words for example. But I and others appreciate the contribution.

Galileon commented 1 month ago

@Galileon An interesting approach but keep in mind you are guessing at word level timestamps - a simple solution, but not necessarily good if you want the actual timestamp of the word. You could imagine a dynamic speaker emphasizing different words for example. But I and others appreciate the contribution.

hi sure i know but if someone not need ultra accouracy it will be fin le as the blovks of text are not crazy long. I will look deeper meybe into transformers code.