Dockerimage: Model is kept in vram after transcription

Hallo nochmal und vielen Dank übrigens fürs coole Projekt😊

Using the small whisper model and "Auto Transcribe" needs almost 11gb of Vram. After transcription and diarization is done the model seems to be kept in vram at 11gb Vram. As a "GPU Poor" person I ask: Would it be possible to flush it automatically after use? is there also a way to set beam size maybe?

edit: this is weird, small uses about 11gb of vram, medium about 9gb and large uses 11gb of vram. Does it batch, or maybe share ram?

Btw: There are a few other projects with way less user friendly webuis/or just api that use faster-whisper or insanely-fast-whisper that need less vram and are also faster. Mostly by using ctranslate, batching, bettertransformer, flashattention-2, or whisper-distil models (also german).

I've already used whisper-asr-webservice which currently doesn't have diarization but tries to implement it via whisperx and wordcab-transcribe which uses Nvidia NeMo for diarization. Maybe some of these ressources are of use to you?

I've no serious programming knowledge, I just dabble a little bit. I just really like your concept of the webui. I actually tried something similar with a simple gradio interface half a year ago which transcribes, diarizes via the wordcab-transcribe api and also formats the .json and associates names. It But it never worked as robustly as I hoped and I stopped working on it due to missing time and programming knowledge. Forgive me for this wall of text, I'm just a little bit excited about the possibilities and really glad I found your project😄

Screenshot 2024-01-02 214458

via the wordcab-api it used about 4gb of Vram with spike of 10gb for the first 20 seconds (probably due to diarization) with the largev2 model and took about 2:30 min for a 22min file Screenshot 2024-01-02 215730 so maybe there's some room for improvement?

I also tried via insanely-fast-whisper which simply uses multiple optimizations and it took about 33 seconds (same file only transcription task, segmenting added about 1:20min) with less than 8gb vram. 150min of audio in less than 5 minutes transcription + diarization (have to recheck the exact time/usage).

JSchmie / ScrAIbe

Dockerimage: Model is kept in vram after transcription #29