As found in https://github.com/clamsproject/app-whisper-wrapper/issues/24#issuecomment-2372234824, when a CLAMS app runs in HTTP + production mode (app.py --production) with CUDA device support, it runs over gunicorn wsgi with multiple workers. It seems that some torch-based CLAMS apps running under the scenario spawns multiple python processes and load multiple copies of the torch model to memory, resulting in OOM errors at some points.
send multiple POST (annotate) requests simultaneously or with short gaps between.
watch VRAM saturation via e.g. nvidia-smi or a similar monitoring util.
Expected behavior
The app should reuse the already-loaded checkpoint/model in the memory. Instead, the app loads the model for each request and then doesn't release the model after the process is completed.
Bug Description
As found in https://github.com/clamsproject/app-whisper-wrapper/issues/24#issuecomment-2372234824, when a CLAMS app runs in HTTP + production mode (
app.py --production
) with CUDA device support, it runs over gunicorn wsgi with multiple workers. It seems that some torch-based CLAMS apps running under the scenario spawns multiple python processes and load multiple copies of the torch model to memory, resulting in OOM errors at some points.Reproduction steps
nvidia-smi
or a similar monitoring util.Expected behavior
The app should reuse the already-loaded checkpoint/model in the memory. Instead, the app loads the model for each request and then doesn't release the model after the process is completed.
Log output
No response
Screenshots
No response
Additional context
Also, it's very likely that this issue shares the same root cause with https://github.com/clamsproject/app-doctr-wrapper/pull/6.