Do you support multiple GPUs to run pipeline speculative decoding?

chenwenyan commented 3 months ago

Hi, I am very interested in your work on PipeInfer! However, the current implementation does not seem to support multiple GPUs. Are there any upcoming plans or suggestions for integrating support for GPUs with pipeline speculative decoding? I have experimented with various approaches, but so far, none of them can work for me. Thanks a lot!

AutonomicPerfectionist commented 3 months ago

In order to use GPUs, the code needs to be updated to the backend v2 system in upstream llama.cpp. That porting effort was started and is available here:

https://github.com/AutonomicPerfectionist/llama.cpp/tree/mpi-gpu

On that branch, pipeinfer is implemented in its own pipeinfer example rather than replacing speculative. I have managed to perform tests with Nvidia GPUs using it, however I would not consider it finished, and it still lags behind the current upstream llama.cpp.

letheantest commented 2 weeks ago

Can you support tensor parallelism within a gpu node, or just leveraging pipeline parallelism?

AutonomicPerfectionist commented 2 weeks ago

Yes, tensor parallelism within a GPU node is supported. Essentially, the MPI backend wraps whatever other backends are being used, providing communication between nodes but otherwise leaving computation to the wrapped backends. This allows for PipeInfer to transparently support GPUs of multiple vendors, as well as any additional features the wrapped backends support. For example, the CUDA backend supports tensor parallelism and its own pipeline parallelism, so both are automatically supported by the MPI backend and, by extension, PipeInfer. Bare in mind, however, that the CUDA backend's pipeline parallelism is not visible to PipeInfer because of this transparent, backend-agnostic design. If one wants to use PipeInfer with multiple GPUs in a pipeline-parallel configuration, multiple MPI processes would need to be ran, one for each GPU, and each GPU would need to be dedicated to exactly one of these processes. This limitation could probably be fixed in the future, with additional updates to the MPI backend.

letheantest commented 2 weeks ago

Thanks! But how can I run Pipeinfer on multiple GPUs within a node with tensor parallelism? It seems that the readme in the repo is just for CPU run.

AutonomicPerfectionist commented 2 weeks ago

The GPU support is present on a different repo for now, due to reproducibility requirements and to maintain the git history of this repo. I should be able to merge the support into this repo in a few weeks. For now, you can find the updated implementation here:

https://github.com/AutonomicPerfectionist/llama.cpp/tree/mpi-gpu/examples/pipeinfer

letheantest commented 2 weeks ago

Thanks! But I wonder which parameter is for tp degree configuration?

AutonomicPerfectionist commented 1 week ago

Passing the --help option to either PipeInfer or the regular main binary will output a list of available options. I believe the tensor parallelism is on by default, and how tensors are split between GPUs is controlled by the --tensor-split option.

AutonomicPerfectionist / PipeInfer

Do you support multiple GPUs to run pipeline speculative decoding? #1