SYSTRAN / faster-whisper

Faster Whisper transcription with CTranslate2
MIT License
12.51k stars 1.05k forks source link

Model Seperate GPU Assignment #946

Open ibrahimdevs opened 3 months ago

ibrahimdevs commented 3 months ago

I'm developing an api with FastApi. There are 2 GPU on my server, so I want to redirect each request to specific GPU. (There will be a queue and lock mechanism to use my gpus sequentially.)

# Load Whisper models for each GPU
model0 = WhisperModel("large-v3", device="cuda", compute_type=model_quantization, download_root=model_path0, device_index=[0])
model1 = WhisperModel("large-v3", device="cuda", compute_type=model_quantization, download_root=model_path1, device_index=[1])

But problem is only my last assigned model is working without a problem. If I define like above, only model1 is working and model0 throws exception.

terminate called after throwing an instance of 'std::runtime_error'
  what():  CUDA failed with error an illegal memory access was encountered

If I define like below, model0 is working and model1 throws the same exception.

# Load Whisper models for each GPU
model1 = WhisperModel("large-v3", device="cuda", compute_type=model_quantization, download_root=model_path1, device_index=[1])
model0 = WhisperModel("large-v3", device="cuda", compute_type=model_quantization, download_root=model_path0, device_index=[0])

Is there any bug about using the same static resources in each WhisperModel or Am I doing something wrong?

Thanks,