Open chenwenyan opened 3 months ago
In order to use GPUs, the code needs to be updated to the backend v2 system in upstream llama.cpp. That porting effort was started and is available here:
https://github.com/AutonomicPerfectionist/llama.cpp/tree/mpi-gpu
On that branch, pipeinfer is implemented in its own pipeinfer
example rather than replacing speculative. I have managed to perform tests with Nvidia GPUs using it, however I would not consider it finished, and it still lags behind the current upstream llama.cpp.
Can you support tensor parallelism within a gpu node, or just leveraging pipeline parallelism?
Yes, tensor parallelism within a GPU node is supported. Essentially, the MPI backend wraps whatever other backends are being used, providing communication between nodes but otherwise leaving computation to the wrapped backends. This allows for PipeInfer to transparently support GPUs of multiple vendors, as well as any additional features the wrapped backends support. For example, the CUDA backend supports tensor parallelism and its own pipeline parallelism, so both are automatically supported by the MPI backend and, by extension, PipeInfer. Bare in mind, however, that the CUDA backend's pipeline parallelism is not visible to PipeInfer because of this transparent, backend-agnostic design. If one wants to use PipeInfer with multiple GPUs in a pipeline-parallel configuration, multiple MPI processes would need to be ran, one for each GPU, and each GPU would need to be dedicated to exactly one of these processes. This limitation could probably be fixed in the future, with additional updates to the MPI backend.
Thanks! But how can I run Pipeinfer on multiple GPUs within a node with tensor parallelism? It seems that the readme in the repo is just for CPU run.
The GPU support is present on a different repo for now, due to reproducibility requirements and to maintain the git history of this repo. I should be able to merge the support into this repo in a few weeks. For now, you can find the updated implementation here:
https://github.com/AutonomicPerfectionist/llama.cpp/tree/mpi-gpu/examples/pipeinfer
Thanks! But I wonder which parameter is for tp degree configuration?
Passing the --help
option to either PipeInfer or the regular main
binary will output a list of available options. I believe the tensor parallelism is on by default, and how tensors are split between GPUs is controlled by the --tensor-split
option.
Hi, I am very interested in your work on PipeInfer! However, the current implementation does not seem to support multiple GPUs. Are there any upcoming plans or suggestions for integrating support for GPUs with pipeline speculative decoding? I have experimented with various approaches, but so far, none of them can work for me. Thanks a lot!