Open ibrahimdevs opened 3 months ago
I'm developing an api with FastApi. There are 2 GPU on my server, so I want to redirect each request to specific GPU. (There will be a queue and lock mechanism to use my gpus sequentially.)
# Load Whisper models for each GPU model0 = WhisperModel("large-v3", device="cuda", compute_type=model_quantization, download_root=model_path0, device_index=[0]) model1 = WhisperModel("large-v3", device="cuda", compute_type=model_quantization, download_root=model_path1, device_index=[1])
But problem is only my last assigned model is working without a problem. If I define like above, only model1 is working and model0 throws exception.
terminate called after throwing an instance of 'std::runtime_error' what(): CUDA failed with error an illegal memory access was encountered
If I define like below, model0 is working and model1 throws the same exception.
# Load Whisper models for each GPU model1 = WhisperModel("large-v3", device="cuda", compute_type=model_quantization, download_root=model_path1, device_index=[1]) model0 = WhisperModel("large-v3", device="cuda", compute_type=model_quantization, download_root=model_path0, device_index=[0])
Is there any bug about using the same static resources in each WhisperModel or Am I doing something wrong?
Thanks,
I'm developing an api with FastApi. There are 2 GPU on my server, so I want to redirect each request to specific GPU. (There will be a queue and lock mechanism to use my gpus sequentially.)
But problem is only my last assigned model is working without a problem. If I define like above, only model1 is working and model0 throws exception.
If I define like below, model0 is working and model1 throws the same exception.
Is there any bug about using the same static resources in each WhisperModel or Am I doing something wrong?
Thanks,