ggerganov / llama.cpp

LLM inference in C/C++
MIT License
60.95k stars 8.7k forks source link

Feature Request: It would be convenient and faster if users could specify that the model data used for a RPC-server instance is already available by some fast(er) means (file system GGUF, whatever). #8114

Open ghchris2021 opened 3 days ago

ghchris2021 commented 3 days ago

Prerequisites

Feature Description

It would be convenient and faster if users could specify that the model data used for a RPC-server instance is already available by some fast(er) means (file system GGUF, whatever).

Since the rpc-server systems are all "smart" devices (fully capable PCs) they may commonly have the model data already available for fast access via loading parts of the model from their file system or something that looks like one (NAS, SAN, ...).

The speed of loading a large chunk of model data into a RPC server could be orders of magnitude faster than loading it over the RPC network connection if there was some such alternative available.

Motivation

I have tried the RPC feature and in my initial research / experiment I found information that suggests that:

1: There is no (known to me) current way to NOT load the RPC-server model data via the filesystem instead of the network.

2: EDIT: The typically achieved speed of a common "LLM end user" ethernet is in the neighborhood of 1 order of magnitude slower than the commonly achievable speed of reading MMAPed model data from a local file system. For commonly utilized models it would be typically no inconvenience to copy the model data file to the RPC utilizing hosts so that they can potentially all have the most rapid available access to load the model data. Models ranging up through 70...400+ GBy are commonly available for local inference and these would be quite interesting for local inference and would typically necessitate RPC use so that they may be inferenced in a distributed way when exceeding what a single host can handle. Transferring (e.g.) 50% or 75% of the size of such model data to other RPC-server hosts will be quite slow over 1Gb/s ethernet and significantly slower than a good SSD even @ 10 Gb/s ethernet rates. So from a performance standpoint I think this option could offer users a repeated (every model loading) time to first inference time latency advantage which is substantial & compelling.

So if the "model URI" could include file:// or whatever then it might provide an "easy" 1-3 orders of magnitude improvement in the latency of getting RPC data between the client and server for the model layers.

Possible Implementation

No response

steampunque commented 3 days ago

I agree this is needed. I only RPC to two machines, a 12G 4070 and a 8G 1070 and the load times are frustratingly slow. Then after loading I often get OOM cuda crashes when the model tries to run and have to reload again backing off NGL even though there seems to be sufficient memory left after load. To help this problem the RPC functionality also needs to be able to specify desired NGL per RPC server to help avoid these OOMs, the default allocation/partitioning heuristic is not reliable.

Best of all would be if a messaging prototol could spin off a "load these layers from model file" message and let it run async on all the RPCs. If the RPC nacks that message due to not being able to find the file or some other reason then fall back to shipping over the RPC socket from host file. Then all the RPCs are loading in parallel (N x speedup for N RPC servers) combined with at least x6 raw BW improvement assuming 6G SATA3 vs 1G ether, easily would be a full 1-2 orders of magnitude faster for loading big models.