exo-explore / exo

Run your own AI cluster at home with everyday devices 📱💻 🖥️⌚
GNU General Public License v3.0
15.58k stars 835 forks source link

15 nodes, job not divided properly #56

Open magnusviri opened 4 months ago

magnusviri commented 4 months ago

I tried 15 nodes and it didn't distribute the job but had them all do the whole thing and combined all of their outputs.

Screenshot_2024-07-18_at_10 04 39

Screenshot_2024-07-18_at_10 03 55

AlexCheema commented 4 months ago

Great to see this running on so many nodes!

This looks like a potentially old version, given the lack of the UI panel. What version of the software are you running? I recently fixed an issue that caused this. @magnusviri

magnusviri commented 4 months ago

Now I'm getting no output.

Screenshot 2024-07-22 at 18 16 20
magnusviri commented 4 months ago

I think the reason I got no output above is because some nodes got stuck downloading a file.

Screenshot 2024-07-22 at 18 34 33

The others all showed this.

Screenshot 2024-07-22 at 18 34 19

I removed all my nodes but one and tried it and it worked. So I added one node at a time and kept trying and 11 worked, but 12 didn't.

Screenshot 2024-07-22 at 18 31 19

I kept trying and it did work at some point. But it was actually slower than if I just ran the LLM on one node. It reported a faster t/s but something wasn't right because it was dramatically slower.

AlexCheema commented 4 months ago

You won't get a speedup if you can fit the entire model one one machine. For now, the main benefit is to run larger models.

With https://github.com/exo-explore/exo/issues/4 we'll be able to get higher throughput by utilisiing all the resources in the cluster with true pipeline parallelism. And later, with more advanced parallelism techniques we'll be able to reduce latency