Model only downloads 1 node at a time.

magnusviri commented 2 months ago

I started up 11 nodes. I watched ~/.cache/huggingface/hub/ on all of them. The model, models--mlx-community--Meta-Llama-3-8B-Instruct-4bit, was downloaded one at a time. When one node finished, the next node would start downloading.

stephanj commented 2 months ago

Lets clarify current (inefficient) implementation aspect using Claude Sonnet 3.5:

Let's analyze this behavior and consider some potential improvements:

Current Behavior:

Nodes are downloading the model one at a time
Each node is independently fetching the entire model from the Hugging Face Hub
There's no coordination or sharing of downloaded resources between nodes

Potential Issues:

Network Inefficiency: Multiple full downloads of the same large model file consume unnecessary bandwidth.
Time Inefficiency: Sequential downloads mean the total setup time increases linearly with the number of nodes.
Storage Redundancy: Each node storing a full copy of the model may be wasteful, especially if local storage is limited.

Possible Improvements:

Parallel Downloads:
- Allow multiple nodes to download simultaneously to reduce total setup time.
- Implement a randomized delay before starting the download to reduce the chance of all nodes starting at exactly the same time.
Local Caching and Sharing:
- Implement a local caching mechanism where nodes can share downloaded model files.
- The first node to complete a download could notify others, allowing them to copy from the local network instead of downloading from Hugging Face.
Partial Downloads:
- If the model can be split into chunks, each node could download only the portion it needs based on its assigned shard.
- Nodes could then exchange these partial downloads to construct the full model collectively.
Centralized Download:
- Designate one node as a "download manager" that fetches the model and distributes it to other nodes.
- This could be combined with a local file server to efficiently share the download.
Pre-download Strategy:
- In a controlled environment, consider pre-downloading the model to a shared location before starting the nodes.
Peer-to-Peer Distribution:
- Implement a peer-to-peer file sharing protocol (like BitTorrent) for efficient distribution of the model across nodes.
Check for Existing Downloads:
- Before initiating a download, check if the model already exists in the cache directory.
- If it exists, verify its integrity and use the cached version instead of re-downloading.

Implementation Considerations:

These improvements would require modifications to the model loading process in the MLXDynamicShardInferenceEngine and potentially other parts of the system.
A new module for coordinating downloads and file sharing between nodes might be necessary.
The chosen solution should balance efficiency gains against added complexity and potential new failure modes.

To implement any of these improvements, you would need to modify the model loading logic, potentially add inter-node communication for file sharing, and possibly create new modules for download management and coordination. The specific approach would depend on your system's architecture, network topology, and operational requirements.

magnusviri commented 2 months ago

Claude is pretty good. 7 and 2 should already exist. At least I think they do.

My priorities are:

1 Parallel Downloads, 3 Partial Downloads, then 4, 5, and 6.

austinbv commented 2 months ago

I opened an issue that I think would be good for this https://github.com/exo-explore/exo/issues/99 We are working on it internally

AlexCheema commented 1 month ago

This is fixed now. Please reopen if still any issues.

exo-explore / exo

Model only downloads 1 node at a time. #70