Possibility to unload/reload model from VRAM/RAM after IDLE timeout

v3DJG6GL commented 9 months ago

First of all thanks for this great project!

Description

I would like to have an option to set an idle time after which the model is unloaded from RAM/VRAM.

Background:

I have several applications that use the VRAM of my GPU, one of these is LocalAI. Since I don't have unlimited VRAM, these applications have to share the available memory among themselves. Luckily, since some time LocalAI has implemented a watchdog functionality that can be used to unload the model after a specified idle timeout. I'd love to have some similar functionality for whisper-asr-webservice For now, whisper-asr-webservice is occupying 1/3rd of my VRAM although it is used only from time to time.

LuisMalhadas commented 7 months ago

I'd like to point out that it implies energy savings as well.

thfrei commented 7 months ago

Wouldn't it be this feature? https://github.com/mudler/LocalAI/pull/1341

v3DJG6GL commented 7 months ago

Wouldn't it be this feature? mudler/LocalAI#1341

Yes, that's the PR I also linked up there.

TigerWolf commented 4 months ago

I have this same problem and would really like this implemented. Can I help at all?

Deathproof76 commented 4 months ago

I've found a slimmed down version of subgen (which is specifically for generating subtitles for Plex or through bazarr by connecting directly to them) called slim-bazarr-subgen, which pretty much does this (it only connects to bazarr uses latest faster-whisper, takes about 20ish seconds for a 22min audio file on a rtx 3090 with large distil v3 with int8_bfloat16).

Disclaimer: Not a coder so just guessing and interpreting from limited knowledge.

This slim version seems to use a task queue approach which more or less "deletes" the model (purges it from vram) when it's done with its tasks and then reloads the model back into vram when a new task is queued. The model reload process takes less than a few seconds on my system (most likely depending if you put it on an ssd or /dev/shm for example). When it's unloaded it only takes up about ~200mb vram for the main process. Maybe someone more knowledgeable could take a look at the main script. It doesn't seem overly complicated to implement for someone with more experience. In comparison it would take me more than a week of fumbling about and I sadly don't have the resources to take on the responsibility right now, I'm counting on you kind strangers out there!🙏 It would be fantastic to have this implemented in whisper-asr!

some excerpts from the main script:

def start_model():
    global model
    if model is None:
        logging.debug("Model was purged, need to re-create")
        model = stable_whisper.load_faster_whisper(whisper_model, download_root=model_location, device=transcribe_device, cpu_threads=whisper_threads, num_workers=concurrent_transcriptions, compute_type=compute_type)

....

def delete_model():
    if task_queue.qsize() == 0:
        global model
        logging.debug("Queue is empty, clearing/releasing VRAM")
        model = None
        gc.collect()

....

    finally:
        task_queue.task_done()
        delete_model()

Btw: If you're interested in running slim-bazarr-subgen yourself but are still running on Ubuntu 22.04 (I was on 23.10 but the same might apply) here's a modified dockerfile with an older cuda version as you otherwise might get problems due to the new libs/drivers not being available:

dockerfile

FROM nvidia/cuda:12.3.2-cudnn9-devel-ubuntu22.04
ENV DEBIAN_FRONTEND=noninteractive

RUN apt-get update && apt-get -y upgrade
RUN apt-get install -y python3-pip libcudnn8
RUN apt-get clean

RUN apt remove -y --allow-remove-essential cuda-compat-12-3 cuda-cudart-12-3 cuda-cudart-dev-12-3 cuda-keyring cuda-libraries-12-3 cuda-libraries-dev-12-3 cuda-nsight-compute-12-3 cuda-nvml-dev-12-3 cuda-nvprof-12-3 cuda-nvtx-12-3 ncurses-base ncurses-bin e2fsprogs
RUN apt autoremove -y

COPY requirements.txt /requirements.txt
COPY entrypoint.sh /entrypoint.sh
RUN chmod +x /entrypoint.sh

ENTRYPOINT [ "/entrypoint.sh" ]

ahmetoner / whisper-asr-webservice

Possibility to unload/reload model from VRAM/RAM after IDLE timeout #196

Description

Background: