X-rayLaser / DistributedLLM

Run LLM inference by spliting models into parts and hosting each part on a separate machine. Project is no longer maintained.
MIT License
5 stars 0 forks source link

Question: would this work on RPIs i.e. ARM CPUs? #1

Open stevef1uk opened 8 months ago

stevef1uk commented 8 months ago

Hi,

It is possible to run some LLMs on the new RPIs with 8GB RAM, but it would be nice to try a larger model running across a number of RPIs as some of us have quite a few of them available to us.

It would be nice if the tool supported this LLaMA repo: https://github.com/ggerganov/llama.cpp/tree/gg/phi-2

Regards

Steve

X-rayLaser commented 8 months ago

Hello.

In short, maybe. But, it won't do much good because you won't be able to run newer LLM models that have architecture different than LLama v1. If you haven't checked it out yet, there is an excellent project called petals that you should try instead: https://github.com/bigscience-workshop/petals.

Longer version. I haven't tested this use case so I can't tell for sure. This repository actually builds on top of llama.cpp project. It uses llama.cpp repository as a git submodule vendor/llama.cpp. But, If you look at the llama.cpp folder under the vendor directory, it refers to very old checkout of LLama.cpp repository with last commit made around July 2023. Therefore, GGUF format is not supported here. Same applies for newer better quantization algorithms as well as other improvements that are now available in original LLama.cpp repo but missing here.

Also, only the most basic features are supported here. You can convert models to GGML format, quantize them and generate text with them, control generation temperature and repeat penalty, but not much more.

stevef1uk commented 8 months ago

Th====

Hello.

In short, maybe. But, it won't do much good because you won't be able to run newer LLM models that have architecture different than LLama v1. If you haven't checked it out yet, there is an excellent project called petals that you should try instead: https://github.com/bigscience-workshop/petals.

Longer version. I haven't tested this use case so I can't tell for sure. This repository actually builds on top of llama.cpp project. It uses llama.cpp repository as a git submodule vendor/llama.cpp. But, If you look at the llama.cpp folder under the vendor directory, it refers to very old checkout of LLama.cpp repository with last commit made around July 2023. Therefore, GGUF format is not supported here. Same applies for newer better quantization algorithms as well as other improvements that are now available in original LLama.cpp repo but missing here.

Also, only the most basic features are supported here. You can convert models to GGML format, quantize them and generate text with them, control generation temperature and repeat penalty, but not much more.

Thanks for a detailed response. I had found Petals, but the instructions did not look beginner friendly.