Open stephanj opened 2 months ago
I'm currently copying the directories manually from .cache/huggingface/hub/models directories using AirDrop to the other machines. This works and is obviously faster when downloading from huggingface.
models--mlx-community--Meta-Llama-3.1-70B-Instruct-4bit models--mlx-community--Meta-Llama-3.1-8B-Instruct-4bit
This would be great. I think the thing that is difficult is doing this in a way that's compatible with one of exo's core design philosophies: node equality. I don't want to have a master file server. The way we could do it is nodes can first ask all their peers if they have a model file before going to hugging face and send it p2p.
Related: #80 #70 #16
@AlexCheema 您好,请问有没有更简便的方式实现从本地加载模型,我无法连接huggingface,不能实现该项目,谢谢
There are also things that might be able to bolt-on or people can easily setup for this scenario. SyncThing and LocalSend come to mind immediately. I think SyncThing can operate in a LAN-only style setup and possibly borrow the peer configuration taking place by Exo.
But Dave's Garage did mention downloading the models took a while even on his super fast network so I'm guessing Hugging Face has a speed limit.
I think futuristically we could use direct p2p feeding access.
I think that would work better instead of having one core server. (Hugging face would be the only core network)
I also think we can divide the downloading task amongst multiple peers. Think bittorrent?
Implementing a feature to allow LLM copying from the local network instead of downloading from Hugging Face is an excellent way to optimize your setup, especially for multi-node environments. Here's a detailed approach to implement this feature:
Local File Server Setup:
http.server
or a more robust solution like nginx) on one of the nodes or a dedicated machine in your local network.Model Registry:
Download and Share Process:
MLXDynamicShardInferenceEngine
to use theget_model
function:Network Configuration:
Error Handling and Fallback:
Version Control:
Security Considerations:
This implementation allows nodes to check a local registry first, download from a local file server if available, and fall back to Hugging Face only when necessary. The first node to download a model will make it available to all other nodes, significantly reducing bandwidth usage and download times for subsequent nodes.