JSchmie / ScrAIbe

Tool for automatic transcription and speaker diarization based on whisper and pyannote.
https://jschmie.github.io/ScrAIbe/
GNU General Public License v3.0
19 stars 3 forks source link

Dockerimage: Model is kept in vram after transcription #29

Closed Deathproof76 closed 8 months ago

Deathproof76 commented 8 months ago

Hallo nochmal und vielen Dank übrigens fürs coole Projekt😊

Using the small whisper model and "Auto Transcribe" needs almost 11gb of Vram. After transcription and diarization is done the model seems to be kept in vram at 11gb Vram. As a "GPU Poor" person I ask: Would it be possible to flush it automatically after use? is there also a way to set beam size maybe?

edit: this is weird, small uses about 11gb of vram, medium about 9gb and large uses 11gb of vram. Does it batch, or maybe share ram?

Btw: There are a few other projects with way less user friendly webuis/or just api that use faster-whisper or insanely-fast-whisper that need less vram and are also faster. Mostly by using ctranslate, batching, bettertransformer, flashattention-2, or whisper-distil models (also german).

I've already used whisper-asr-webservice which currently doesn't have diarization but tries to implement it via whisperx and wordcab-transcribe which uses Nvidia NeMo for diarization. Maybe some of these ressources are of use to you?

I've no serious programming knowledge, I just dabble a little bit. I just really like your concept of the webui. I actually tried something similar with a simple gradio interface half a year ago which transcribes, diarizes via the wordcab-transcribe api and also formats the .json and associates names. It But it never worked as robustly as I hoped and I stopped working on it due to missing time and programming knowledge. Forgive me for this wall of text, I'm just a little bit excited about the possibilities and really glad I found your project😄

Screenshot 2024-01-02 214458

via the wordcab-api it used about 4gb of Vram with spike of 10gb for the first 20 seconds (probably due to diarization) with the largev2 model and took about 2:30 min for a 22min file Screenshot 2024-01-02 215730 so maybe there's some room for improvement?

I also tried via insanely-fast-whisper which simply uses multiple optimizations and it took about 33 seconds (same file only transcription task, segmenting added about 1:20min) with less than 8gb vram. 150min of audio in less than 5 minutes transcription + diarization (have to recheck the exact time/usage).

JSchmie commented 8 months ago

There will be a new release quite soon where you can remove the model after using it and many other optimization options. So stay tuned. regarding your edit, this will be fixed in the next version. I know about the fast methods, but currently the plan is not to support every whisper model out there. The product aims to be robust, and therefore I am willing to take the slower model. But I agree that there is space for improvement and I want to try to implement other models if they are robust and maintained. First, I would like to try to use model quantization to allow a good performance on CPU.