Vaibhavs10 / insanely-fast-whisper

Apache License 2.0
7.5k stars 529 forks source link

Rookie topic - docker image creation and hosting #102

Open MaximeDde opened 10 months ago

MaximeDde commented 10 months ago

Hi guys, I’ve been desperately trying to host this model on a Google Cloud container - I’m extremely new to all this, and need your help… I’ve been trying to add a Flask server structure that would let me execute the commands precised in the README.md file, correctly adjusting the pdm.lock before so with the Flask package, making sure since I’m working on a Mac initially that I build the image with —platform Linux/amd)4 to avoid exec format error, but it never seems to work. Could I ask for help in setting this…? Please tell me what details I can provide to help solve this, and apologies in advance if this issue is too vague, but, well… I’m just starting here 😅 Thanks a lot for your awesome work anyhow !

Vaibhavs10 commented 10 months ago

Good question. You can actually use the awesome deployment on Replicate by @chenxwh here: https://replicate.com/vaibhavs10/incredibly-fast-whisper

The cog is here: https://github.com/Vaibhavs10/insanely-fast-whisper/pull/42

MaximeDde commented 10 months ago

I actually found the demo, but wasn't able to use it, which is why I was looking for a way to host it myself in the end...

The error I had in the demo was the following :

Traceback (most recent call last):
File "/root/.pyenv/versions/3.11.6/lib/python3.11/site-packages/cog/server/worker.py", line 217, in _predict
result = predict(**payload)
^^^^^^^^^^^^^^^^^^
File "/src/predict.py", line 85, in predict
outputs = self.pipe(
^^^^^^^^^^
File "/root/.pyenv/versions/3.11.6/lib/python3.11/site-packages/transformers/pipelines/automatic_speech_recognition.py", line 357, in __call__
return super().__call__(inputs, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/.pyenv/versions/3.11.6/lib/python3.11/site-packages/transformers/pipelines/base.py", line 1132, in __call__
return next(
^^^^^
File "/root/.pyenv/versions/3.11.6/lib/python3.11/site-packages/transformers/pipelines/pt_utils.py", line 124, in __next__
item = next(self.iterator)
^^^^^^^^^^^^^^^^^^^
File "/root/.pyenv/versions/3.11.6/lib/python3.11/site-packages/transformers/pipelines/pt_utils.py", line 266, in __next__
processed = self.infer(next(self.iterator), **self.params)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/.pyenv/versions/3.11.6/lib/python3.11/site-packages/transformers/pipelines/base.py", line 1046, in forward
model_outputs = self._forward(model_inputs, **forward_params)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/.pyenv/versions/3.11.6/lib/python3.11/site-packages/transformers/pipelines/automatic_speech_recognition.py", line 552, in _forward
generate_kwargs["num_frames"] = stride[0] // self.feature_extractor.hop_length
~~~~~~~~~~^^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
TypeError: unsupported operand type(s) for //: 'tuple' and 'int'

I tried with several .mp3 files, without any success... I filled the form with task : transcribe, language empty, batch_size empty, timestamp : word, and my hf token...

So I tried the same demo with timestamp : chunk, and the same for all the rest, but again, impossible to get anything working... Here's the error I always encounter :

Segmenting the audio clips.
Traceback (most recent call last):
File "/root/.pyenv/versions/3.11.6/lib/python3.11/site-packages/cog/server/worker.py", line 217, in _predict
result = predict(**payload)
^^^^^^^^^^^^^^^^^^
File "/src/predict.py", line 111, in predict
segmented_transcript = post_process_segments_and_transcripts(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/src/predict.py", line 231, in post_process_segments_and_transcripts
upto_idx = np.argmin(np.abs(end_timestamps - end_time))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/.pyenv/versions/3.11.6/lib/python3.11/site-packages/numpy/core/fromnumeric.py", line 1325, in argmin
return _wrapfunc(a, 'argmin', axis=axis, out=out, **kwds)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/.pyenv/versions/3.11.6/lib/python3.11/site-packages/numpy/core/fromnumeric.py", line 59, in _wrapfunc
return bound(*args, **kwds)
^^^^^^^^^^^^^^^^^^^^
ValueError: attempt to get argmin of an empty sequence

Which let me with the idea of trying to host the model myself to try and see if that may be working better... So I guess that if I can find a way to make the demo work, I can use Replicate...? But right now I'm not sure...!

EDIT : I tried several audios, some seem to work, and some don't... But I do not know why :(

Vaibhavs10 commented 10 months ago

Hi @MaximeDde - I'll push out an update today/ tomorrow to fix the speaker diarization issues. Sorry about it!

MaximeDde commented 10 months ago

I mean, @Vaibhavs10 , you are doing an amazing job and providing it to us for free, please do not apologize haha, I hope I can give it back to you if the system I build with your model works, that's what I want ! :D

Keep me posted, and thanks a lot for your amazing work !

MaximeDde commented 9 months ago

Hey @Vaibhavs10 , just wanted to ask if you had some news on this one, if I may...! Eager to use the Replica deployment of the model asap ! :D Take care !

Vaibhavs10 commented 9 months ago

Hi @MaximeDde - This should be fixed on Replicate! :)