ELS-RD / transformer-deploy

Efficient, scalable and enterprise-grade CPU/GPU inference server for 🤗 Hugging Face transformer models 🚀
https://els-rd.github.io/transformer-deploy/
Apache License 2.0
1.64k stars 150 forks source link

Two GPU are slower than one #156

Open OleksandrKorovii opened 1 year ago

OleksandrKorovii commented 1 year ago

Hi, I run Triton web server on two GPUs NVIDIA RTX3090Ti with --shm-size 20g. When I do inference, I get time near 1.56s. But if I run web server with only one GPU set --gpus '"device=0"' after that I get the time near 860ms. Length of input sequence was 256 tokens. I've optimized GPT2-medium by your script.

convert_model -m gpt2-medium \
    --backend tensorrt onnx \
    --seq-len 32 512 512 \
    --task text-generation --atol=2"