ggerganov / llama.cpp

LLM inference in C/C++
MIT License
65.67k stars 9.42k forks source link

Feature Request: RDMA support for rpc back ends #9493

Open slavonnet opened 2 weeks ago

slavonnet commented 2 weeks ago

Prerequisites

Feature Description

The network stack has delays and a small frame size. If you apply RDMA, you can achieve the speed of hundreds of backends, as if running on a single server

Motivation

It would be good to be able to synchronize the execution results in layers between backends via RDMA to reduce delays

Possible Implementation

If you do not have support or hardware with RDMA, you can use the RXE kernel module for emulation

https://enterprise-support.nvidia.com/s/article/howto-configure-soft-roce

rgerganov commented 2 weeks ago

There were other people asking for RDMA support in a recent discussion as well. I don't have such hardware but it's nice to see there is software emulation.

I will try to spend some cycles on this in the near term. Patches are also welcome.

slavonnet commented 2 weeks ago

Patches are also welcome

I can't write a patch from scratch, but I may fix bugs in the future. Unfortunately, there is no free time

I will be able to check the inference on the CPU on 3 servers (512 GB of memory on each + Mellanox Connectx 3 + infiniband switch)

And next year I plan to have several servers in each of 3x4060 ti