exo-explore / exo

Run your own AI cluster at home with everyday devices 📱💻 🖥️⌚
GNU General Public License v3.0
10.99k stars 639 forks source link

Enhancement: Allow to download shards in series within 1 device #137

Closed barsuna closed 2 months ago

barsuna commented 2 months ago

Currently a device in a cluster tries to download its shards in parallel. For large shards/models this sometimes results in ~15 download threads (~i.e. llama 3.1 70B is 30 large safetensor files - if device gets half of that it gets to 15 simulataneous downloads). This leads to an issue where if the underlying media hosting ~.cache/huggingface/hub is non-ssd the files end up laid on media with very high fragmentation - very low read performance.

(for example on my system - normal file is read at about ~130MB/sec, but files downloaded by exo are read at about 4.5MB/sec!!! /until defragmented/)

It might be sensible idea to provide a option/knob to download shards in series (within 1 device) - there is no advantage to parallel download in many cases (while there is certainly advantage to download in parallel between devices)

AlexCheema commented 2 months ago

Hey, the download part makes sense - many downloads in parallel is slow (easy fix) - but I don't understand what you mean about file read performance? Is that a different issue?

AlexCheema commented 2 months ago

Check #138 and see if that fixes it?

barsuna commented 2 months ago

Hey, the download part makes sense - many downloads in parallel is slow (easy fix) - but I don't understand what you mean about file read performance? Is that a different issue?

@AlexCheema, thank you for quick follow-up - indeed this fixed the issue.

The file read performance point is the following: when we download many files in parallel - these files end up fragmented on a file system (i.e. they cannot be read without extra seek operations, which visibly reduces read/write performance on spinning disks (less so on SSDs). Different filesystems will have different impact, in my case it was NTFS which turned out to suffer quite badly).

Net result - if at ~100MB/sec it takes say 700 seconds to load 50% shard (70GB) of Llama 3.1 - at 5MB/sec it takes 20x times that - not really practical anymore

i know, i know who uses spinning disks to keep models etc... but imo we should at least be able to get performance of underlying media. Thank you again for enhancing this quickly.