exo-explore / exo

Run your own AI cluster at home with everyday devices 📱💻 🖥️⌚
GNU General Public License v3.0
6.3k stars 322 forks source link

Refactor mlx model sharding #84

Open mzbac opened 1 month ago

mzbac commented 1 month ago

Put this issue as a discussion point before I create the massive PR to refactor the codebase.

@AlexCheema

AlexCheema commented 1 month ago

Definitely in need of a refactor.

Your suggestions sound good to me.

One thing I’d like to support (I tried recently but stopped because I realised we should refactor before hacking more) is async model downloading. Right now it “looks” like it works but it actually blocks the main thread and as a result when nodes are downloading large models, they block and stop broadcasting discovery messages so get dropped by other peers. Perhaps it would be good to add that with this refactor.

mzbac commented 1 month ago

Definitely in need of a refactor.

Your suggestions sound good to me.

One thing I’d like to support (I tried recently but stopped because I realised we should refactor before hacking more) is async model downloading. Right now it “looks” like it works but it actually blocks the main thread and as a result when nodes are downloading large models, they block and stop broadcasting discovery messages so get dropped by other peers. Perhaps it would be good to add that with this refactor.

It sounds like what the current mlx_lm server does (starts the server first and loads the model when the request comes in), we should be able to add that.

AlexCheema commented 1 month ago

Definitely in need of a refactor. Your suggestions sound good to me. One thing I’d like to support (I tried recently but stopped because I realised we should refactor before hacking more) is async model downloading. Right now it “looks” like it works but it actually blocks the main thread and as a result when nodes are downloading large models, they block and stop broadcasting discovery messages so get dropped by other peers. Perhaps it would be good to add that with this refactor.

It sounds like what the current mlx_lm server does (starts the server first and loads the model when the request comes in), we should be able to add that.

Exo already does that, but downloading the model is blocking - that’s the problem.

mzbac commented 1 month ago

Definitely in need of a refactor. Your suggestions sound good to me. One thing I’d like to support (I tried recently but stopped because I realised we should refactor before hacking more) is async model downloading. Right now it “looks” like it works but it actually blocks the main thread and as a result when nodes are downloading large models, they block and stop broadcasting discovery messages so get dropped by other peers. Perhaps it would be good to add that with this refactor.

It sounds like what the current mlx_lm server does (starts the server first and loads the model when the request comes in), we should be able to add that.

Exo already does that, but downloading the model is blocking - that’s the problem.

Alright, I'm not very familiar with the Exo codebase at the moment. I will take a close look once I get there :)