Can Exo run smaller language models on my Mali GPU device with 3GB RAM?

exo-explore / exo

Run your own AI cluster at home with everyday devices 📱💻 🖥️⌚

GNU General Public License v3.0

6.53k stars 339 forks source link

I have a device with a Mali GPU and only 3GB of available RAM. I'm interested in leveraging large language models, but I'm not sure if my hardware would be capable of running them, even with a system like Exo. I have a few questions:

Can Exo be used to run smaller language models on my hardware? What is the process for doing this? For an 8B language model, would I need to use multiple devices? If so, does Exo handle partitioning the model so each device only loads a portion of it? Can you provide a high-level overview of how the Exo architecture works? Do the devices running Exo need to be on the same local network, or can they be distributed across different networks? Additionally, I'm not very familiar with the process of loading large language models. I'm concerned about the download size being too large for my device, and I'm not sure how to access the models if my device can't connect to Hugging Face. I was thinking of setting up a local file server, but I'm not very experienced with modifying the code for Exo.

I'd really appreciate if you could provide some concrete guidance on using Exo to run smaller language models on my Mali GPU device with 3GB of RAM. Any insights into the Exo architecture and how to handle the model loading process would be very helpful. Thank you!

Interesting question about Mali GPU. In general, if tinygrad supports it, exo supports it. I couldn't see anything about tinygrad specifically supporting Mali GPU but it may just work with the OpenCL accelerator. You can explicitly enable it by passing GPU=1 when starting exo, i.e. GPU=1 python3 main.py.

Yes, exo handles partitioning the model so that each device only keeps part of it in memory and computes on that part.

High-level overview of the exo architecture is explained in the README. I can write something more in depth if that's helpful?

The devices do need to be on the same local network currently, but we can easily plug in different discovery mechanisms by writing a new networking module (you can see how discovery currently works here: https://github.com/exo-explore/exo/blob/main/exo/networking/grpc/grpc_discovery.py)

Your device would need to connect to HuggingFace right now to download the model. @mzbac is looking into alternative ways, maybe we can support specifying a custom url to download from @mzbac https://github.com/exo-explore/exo/issues/84

exo-explore / exo

Can Exo run smaller language models on my Mali GPU device with 3GB RAM? #82