Open barsuna opened 2 months ago
SUPPORT_BF16=0
e.g. SUPPORT_BF16=0 python3 main.py
Thanks for the detailed issue - this really helps. Please continue to make issues / comments so we can improve exo to fulfil your use-case.
Thank you @AlexCheema!
Ack on 1. I realize now these are static numbers. If determining these dynamically, it seems sensible to also establish bus bandwidth and GPU memory bandwidth - i imagine the overall perf would influence how big a shard each device will get?
On 2. The problem seems to have actually went away (i'm at c4b261daf178e59ffaabe781815a9116efd923d4) with original command line CUDA=1 python main.py --max-parallel-downloads 1 --disable-tui --wait-for-peers 1
and now i am able to use 2 nodes together, but if i actually try SUPPORT_BF16=0 - it causes python3.12 to segfault. But at i'm further than a was
I've managed to get 2 instances on the same host to more or less work in the following way:
(instance 1)
export CUDA_VISIBLE_DEVICES=0
CUDA=1 VISIBLE_DEVICES=0 python main.py --max-parallel-downloads 1 --disable-tui --wait-for-peers 1 --node-id 1111 --broadcast-port 55555
(instance 2)
export CUDA_VISIBLE_DEVICES=1
CUDA=1 VISIBLE_DEVICES=1 python main.py --max-parallel-downloads 1 --disable-tui --wait-for-peers 1 --node-id 2222 --listen-port 55555
the listen-port is needed because else there is a port conflict (this port is already used by instance 1) and broadcast port is needed because otherwise node 2 doesnt hear from node 1 (though opposite is not true - seems grpc allows such assymmetric/unidirectional communication).
Unless i'm off in some completelly wrong direction, we may want to have more clean way to run multiple instances per host (there are still issues like both instances try to bind to the same port for API etc).
Looking forward to quantization on tinygrad, thank you!
Got it, thank you!
SUPPORT_BF16
shenanigans needs to be fixed properly. It should be as simple as running one command to run exo -- no configuration. I'm working on improving this, could use help on this!--broadcast-port
too. There's an example here where we run 2 nodes on the same host: https://github.com/exo-explore/exo/blob/394935711b3c49ab3467301188ab560b09bedd79/.circleci/config.yml#L19-L25thank you @AlexCheema !
On 3 - this approch seems limited to 2 processes, we still need something different for when there is >2 instances. I tried to put each instance in a docker container, but wasnt able to get everything working quickly - it is little limiting anyways.
Do you plan to stay with 1 instance per GPU and do sharding between the local GPUs or perhaps considering to update the discovery to handle same host instances?
folks, thank you for a very interesting project! i'm trying some basic scenarios and hit a few snags. Happy to split all to separate issues if needed.
Setup:
node/host1, ubuntu 24.04, 96gb ram, nvidia 4090 (24gb) node/host2, ubuntu 24.04, 64gb ram, 2x nvidia titan-v (12gb each)
I can run successfully llama3.1 8B when host1 works alone (get about 8 tokens / sec)
If i start 2nd host, there is a number of issues:
i.e
if i do force both sides to CUDA mode i get error above (point 2), if i do not force them and node2 works in CUDA, then i get error below:
In tinygrad repo the example for llama3 has some quantization supported - int8 and nf4 - does/can exo also support this? Loading models at 16 bits is a luxury few of us can afford :) While i can confirm that i can run llama3.1 70B quantized to 4 bits fully on these 3 GPUs if i put them into a signle node - it is my hope to get this running also on exo (i understand that llama.cpp support is in the works for now). On the same topic - it seems the MLX engine mostly refers to quantized models while tinygrad seems to pull 16bit weights, any reasons for this?
This is more of a question, is there a fallback to CPU supported? Or exo expect everything to load into GPUs. It seems to be latter, but wanted to confirm, so i understand the logic better.
Thank you again!