Open magnusviri opened 4 months ago
Great to see this running on so many nodes!
This looks like a potentially old version, given the lack of the UI panel. What version of the software are you running? I recently fixed an issue that caused this. @magnusviri
Now I'm getting no output.
I think the reason I got no output above is because some nodes got stuck downloading a file.
The others all showed this.
I removed all my nodes but one and tried it and it worked. So I added one node at a time and kept trying and 11 worked, but 12 didn't.
I kept trying and it did work at some point. But it was actually slower than if I just ran the LLM on one node. It reported a faster t/s but something wasn't right because it was dramatically slower.
You won't get a speedup if you can fit the entire model one one machine. For now, the main benefit is to run larger models.
With https://github.com/exo-explore/exo/issues/4 we'll be able to get higher throughput by utilisiing all the resources in the cluster with true pipeline parallelism. And later, with more advanced parallelism techniques we'll be able to reduce latency
I tried 15 nodes and it didn't distribute the job but had them all do the whole thing and combined all of their outputs.