Closed GetUpAt8 closed 5 days ago
Hi, @GetUpAt8 Thanks for your question! I regret to tell you that currently our project does not implement the parallel operation of the model on the GPU in the inference stage. We can provide some ideas for implementing model parallel computing in the inference stage. You can use the pipeline to divide each layer into different GPUs in parallel. For implementation, you can refer to Gpipe, or use zero-infinity to load the model to the CPU, and then transfer each layer to the GPU for inference.
Hope it can help you!
Hi, @GetUpAt8 We have implemented parallel inference on multiple GPUs by utilizing tensor_parallel. You can achieve parallel inference through the latest code of the repository.
I wish it could help you!
Thank u so much for your replies and update in parallel inference! I'll try it.
Hi,
It's a great job, thanks for your contribution to LLM. I've try to use your model before&after the update [June 06,2023], and I wonder how to use mutli-GPUs when inference?
In the new "translate.sh" , there are _"export CUDA_VISIBLEDEVICES=" . I set the value '3,4' , but it still runs on the single GPU 3.