Open v3DJG6GL opened 9 months ago
I'd like to point out that it implies energy savings as well.
Wouldn't it be this feature? https://github.com/mudler/LocalAI/pull/1341
Wouldn't it be this feature? mudler/LocalAI#1341
Yes, that's the PR I also linked up there.
I have this same problem and would really like this implemented. Can I help at all?
I've found a slimmed down version of subgen (which is specifically for generating subtitles for Plex or through bazarr by connecting directly to them) called slim-bazarr-subgen, which pretty much does this (it only connects to bazarr uses latest faster-whisper, takes about 20ish seconds for a 22min audio file on a rtx 3090 with large distil v3 with int8_bfloat16).
Disclaimer: Not a coder so just guessing and interpreting from limited knowledge.
This slim version seems to use a task queue approach which more or less "deletes" the model (purges it from vram) when it's done with its tasks and then reloads the model back into vram when a new task is queued. The model reload process takes less than a few seconds on my system (most likely depending if you put it on an ssd or /dev/shm for example). When it's unloaded it only takes up about ~200mb vram for the main process. Maybe someone more knowledgeable could take a look at the main script. It doesn't seem overly complicated to implement for someone with more experience. In comparison it would take me more than a week of fumbling about and I sadly don't have the resources to take on the responsibility right now, I'm counting on you kind strangers out there!🙏 It would be fantastic to have this implemented in whisper-asr!
some excerpts from the main script:
def start_model():
global model
if model is None:
logging.debug("Model was purged, need to re-create")
model = stable_whisper.load_faster_whisper(whisper_model, download_root=model_location, device=transcribe_device, cpu_threads=whisper_threads, num_workers=concurrent_transcriptions, compute_type=compute_type)
....
def delete_model():
if task_queue.qsize() == 0:
global model
logging.debug("Queue is empty, clearing/releasing VRAM")
model = None
gc.collect()
....
finally:
task_queue.task_done()
delete_model()
Btw: If you're interested in running slim-bazarr-subgen yourself but are still running on Ubuntu 22.04 (I was on 23.10 but the same might apply) here's a modified dockerfile with an older cuda version as you otherwise might get problems due to the new libs/drivers not being available:
dockerfile
FROM nvidia/cuda:12.3.2-cudnn9-devel-ubuntu22.04
ENV DEBIAN_FRONTEND=noninteractive
RUN apt-get update && apt-get -y upgrade
RUN apt-get install -y python3-pip libcudnn8
RUN apt-get clean
RUN apt remove -y --allow-remove-essential cuda-compat-12-3 cuda-cudart-12-3 cuda-cudart-dev-12-3 cuda-keyring cuda-libraries-12-3 cuda-libraries-dev-12-3 cuda-nsight-compute-12-3 cuda-nvml-dev-12-3 cuda-nvprof-12-3 cuda-nvtx-12-3 ncurses-base ncurses-bin e2fsprogs
RUN apt autoremove -y
COPY requirements.txt /requirements.txt
COPY entrypoint.sh /entrypoint.sh
RUN chmod +x /entrypoint.sh
ENTRYPOINT [ "/entrypoint.sh" ]
First of all thanks for this great project!
Description
I would like to have an option to set an idle time after which the model is unloaded from RAM/VRAM.
Background:
I have several applications that use the VRAM of my GPU, one of these is LocalAI. Since I don't have unlimited VRAM, these applications have to share the available memory among themselves. Luckily, since some time LocalAI has implemented a watchdog functionality that can be used to unload the model after a specified idle timeout. I'd love to have some similar functionality for whisper-asr-webservice For now, whisper-asr-webservice is occupying 1/3rd of my VRAM although it is used only from time to time.