gunicorn, torch, and cuda

Bug Description

As found in https://github.com/clamsproject/app-whisper-wrapper/issues/24#issuecomment-2372234824, when a CLAMS app runs in HTTP + production mode (app.py --production) with CUDA device support, it runs over gunicorn wsgi with multiple workers. It seems that some torch-based CLAMS apps running under the scenario spawns multiple python processes and load multiple copies of the torch model to memory, resulting in OOM errors at some points.

Reproduction steps

pick a computer with a CUDA device (nvidia gpu).
run whisper wrapper v10 (https://apps.clams.ai/whisper-wrapper/v10/) in the production mode.
send multiple POST (annotate) requests simultaneously or with short gaps between.
watch VRAM saturation via e.g. nvidia-smi or a similar monitoring util.

Expected behavior

The app should reuse the already-loaded checkpoint/model in the memory. Instead, the app loads the model for each request and then doesn't release the model after the process is completed.

Log output

No response

Screenshots

No response

Additional context

Also, it's very likely that this issue shares the same root cause with https://github.com/clamsproject/app-doctr-wrapper/pull/6.

clamsproject / clams-python