Open eschmidbauer opened 11 months ago
this could be related to https://github.com/huggingface/transformers/issues/27446
It is indeed related - will be fixed by https://github.com/huggingface/transformers/pull/26699.
In the meantime @eschmidbauer you should be able to do the following:
insanely-fast-whisper --file-name test.wav --timestamp word --batch-size 1
It'll be slower but would do the job!
(leaving this open till we patch this in transformers)
Is batch-size 1 24x slower than running the default prompt?
On testing via T4 GPU, while https://github.com/huggingface/transformers/pull/26699 does fix the issue, the GPU consumption ramps up quite significantly. While processing a 3 min 19 audio file, without word timestamps, it is able to complete the process with 24 batch size (default) under 30 seconds, with GPU consumption ~9.5 GiB. But, with the word timestamps, it causes GPU OOM, with 24 batch size, the best I was able to do with 16 GB T4 GPU was batch size as 2, which took around 2.5 minutes, while still GPU consumption with this batch size going to around ~10 GiB.
@Vaibhavs10 were you able to use word level timestamp, what are your speeds? @Pranjalya seems to show not much speed improvement with word level timestamps?
Hi @msj121 - This is still open: https://github.com/huggingface/transformers/pull/26699. The team at HF is relatively small, and we're handling quite a lot of maintenance. This is a priority, tho. I will make sure to keep you posted.
Just fixed by https://github.com/huggingface/transformers/pull/28114.
I didnt find a good solution for this. I need timestamp every word
Just fixed by huggingface/transformers#28114.
I installed the latest transformers version, but I still getting the same error
This is still not fixed. It returns an error: ValueError: WhisperFlashAttention2 attention does not support output_attentions
hello Meybe it wil help U - i am not py developer so code looks little funny but it works:
from nltk.tokenize import RegexpTokenizer
tokenizer = RegexpTokenizer(r'\w+')
results = []
for item in segments['chunks']:
t1,t2 = item['timestamp'][0],item['timestamp'][1]
oriLen = len(item['text'])
oritext = item['text']
timelen = t2 - t1
parts = tokenizer.tokenize (item['text'])
for part in parts:
partLen ,partindex = len(part) ,oritext.index(part)
perttime = t1 + partindex * (timelen / oriLen)
results.append(json.loads('{"Text":"%s","Time":"%f"}' % (part, perttime)))
return results,end - start, segments['text']
@Galileon An interesting approach but keep in mind you are guessing at word level timestamps - a simple solution, but not necessarily good if you want the actual timestamp of the word. You could imagine a dynamic speaker emphasizing different words for example. But I and others appreciate the contribution.
@Galileon An interesting approach but keep in mind you are guessing at word level timestamps - a simple solution, but not necessarily good if you want the actual timestamp of the word. You could imagine a dynamic speaker emphasizing different words for example. But I and others appreciate the contribution.
hi sure i know but if someone not need ultra accouracy it will be fin le as the blovks of text are not crazy long. I will look deeper meybe into transformers code.
when specifying word timestamps on a
3m 45s
file, I am seeing a crash