Refactor mlx model sharding

exo-explore / exo

Run your own AI cluster at home with everyday devices 📱💻 🖥️⌚

GNU General Public License v3.0

6.3k stars 322 forks source link

Refactor mlx model sharding #84

Open mzbac opened 1 month ago

mzbac commented 1 month ago

Put this issue as a discussion point before I create the massive PR to refactor the codebase.

Try to keep the reuse model structure as much as possible from mlx_lm, similar to https://github.com/mzbac/mlx_sharding/blob/main/shard/server/model/llama.py ✅
Use identity block to replace the decode layers which are not part of the shard to enable load model weight in strict mode: https://github.com/mzbac/mlx_sharding/blob/main/shard/server/model/llama.py#L29-L33 ✅
Reuse mlx_lm's load_model instead of copying and pasting load_model given that https://github.com/ml-explore/mlx-examples/pull/899 has been merged.
Support loading by local file/huggingface model_id.

@AlexCheema

AlexCheema commented 1 month ago

Definitely in need of a refactor.

Your suggestions sound good to me.

One thing I’d like to support (I tried recently but stopped because I realised we should refactor before hacking more) is async model downloading. Right now it “looks” like it works but it actually blocks the main thread and as a result when nodes are downloading large models, they block and stop broadcasting discovery messages so get dropped by other peers. Perhaps it would be good to add that with this refactor.

mzbac commented 1 month ago

Definitely in need of a refactor.

Your suggestions sound good to me.

One thing I’d like to support (I tried recently but stopped because I realised we should refactor before hacking more) is async model downloading. Right now it “looks” like it works but it actually blocks the main thread and as a result when nodes are downloading large models, they block and stop broadcasting discovery messages so get dropped by other peers. Perhaps it would be good to add that with this refactor.

It sounds like what the current mlx_lm server does (starts the server first and loads the model when the request comes in), we should be able to add that.

AlexCheema commented 1 month ago

Definitely in need of a refactor. Your suggestions sound good to me. One thing I’d like to support (I tried recently but stopped because I realised we should refactor before hacking more) is async model downloading. Right now it “looks” like it works but it actually blocks the main thread and as a result when nodes are downloading large models, they block and stop broadcasting discovery messages so get dropped by other peers. Perhaps it would be good to add that with this refactor.

It sounds like what the current mlx_lm server does (starts the server first and loads the model when the request comes in), we should be able to add that.

Exo already does that, but downloading the model is blocking - that’s the problem.

mzbac commented 1 month ago

Definitely in need of a refactor. Your suggestions sound good to me. One thing I’d like to support (I tried recently but stopped because I realised we should refactor before hacking more) is async model downloading. Right now it “looks” like it works but it actually blocks the main thread and as a result when nodes are downloading large models, they block and stop broadcasting discovery messages so get dropped by other peers. Perhaps it would be good to add that with this refactor.

It sounds like what the current mlx_lm server does (starts the server first and loads the model when the request comes in), we should be able to add that.

Exo already does that, but downloading the model is blocking - that’s the problem.

Alright, I'm not very familiar with the Exo codebase at the moment. I will take a close look once I get there :)