OpenNMT / CTranslate2

Fast inference engine for Transformer models
https://opennmt.net/CTranslate2
MIT License
3.41k stars 303 forks source link

Use multiple GPUs to process queue #1816

Open theodufort opened 1 week ago

theodufort commented 1 week ago

I am trying to use both of my GPUs who are passed through to my docker container.

services: faster-whisper-server-cuda: image: fedirz/faster-whisper-server:latest-cuda build: dockerfile: Dockerfile.cuda context: . platforms: - linux/amd64 - linux/arm64 restart: unless-stopped ports: - 8162:8000 environment: - WHISPER__MODEL=deepdml/faster-whisper-large-v3-turbo-ct2 - WHISPER__INFERENCE_DEVICE=cuda - WHISPER__COMPUTE_TYPE=int8 - WHISPER__NUM_WORKERS=4 - WHISPER__CPU_THREADS=4 - WHISPER_DEVICE=cuda - DEFAULT_LANGUAGE=en - PRELOAD_MODELS=["deepdml/faster-whisper-large-v3-turbo-ct2"] volumes: - hugging_face_cache:/root/.cache/huggingface privileged: true deploy: resources: reservations: devices: - driver: nvidia count: all capabilities: [gpu] volumes: hugging_face_cache: I tried everything but it won't use more than 1 GPU even if:

image

minhthuc2502 commented 19 hours ago

Consider adding device_index=[0,1] when set up your Dockerfile.