ggerganov / llama.cpp

LLM inference in C/C++
MIT License
67.34k stars 9.67k forks source link

Feature Request: RDMA support for rpc back ends #9493

Open slavonnet opened 1 month ago

slavonnet commented 1 month ago

Prerequisites

Feature Description

The network stack has delays and a small frame size. If you apply RDMA, you can achieve the speed of hundreds of backends, as if running on a single server

Motivation

It would be good to be able to synchronize the execution results in layers between backends via RDMA to reduce delays

Possible Implementation

If you do not have support or hardware with RDMA, you can use the RXE kernel module for emulation

https://enterprise-support.nvidia.com/s/article/howto-configure-soft-roce

rgerganov commented 1 month ago

There were other people asking for RDMA support in a recent discussion as well. I don't have such hardware but it's nice to see there is software emulation.

I will try to spend some cycles on this in the near term. Patches are also welcome.

slavonnet commented 1 month ago

Patches are also welcome

I can't write a patch from scratch, but I may fix bugs in the future. Unfortunately, there is no free time

I will be able to check the inference on the CPU on 3 servers (512 GB of memory on each + Mellanox Connectx 3 + infiniband switch)

And next year I plan to have several servers in each of 3x4060 ti

github-actions[bot] commented 6 days ago

This issue was closed because it has been inactive for 14 days since being marked as stale.

slavonnet commented 6 days ago

@rgerganov Please reopen. Bot was auto close issue