Open fullofcaffeine opened 1 month ago
I don't see anything obviously wrong with your setup. It looks all correct.
The logs suggest perhaps some networking issues. The fact that it generates some tokens then stops also confirms a network issue. What network are you running on? What's the bandwidth / latency / jitter between devices like? Can you try pinging or running a small network test with iperf3
Hi Alex! Thanks for the reply.
I don't see anything obviously wrong with your setup. It looks all correct.
Cool, that's good to read. As a side question, I assume the "0TFLOPs" for the two Linux nodes there is not too important then?
What network are you running on?
It's a regular LAN, and the boxes are all connecting via wifi (5ghz). My router is a Synology rt2600ac, all nodes are connected to the same wifi network.
Let me know if you need more info about it or the nodes.
Can you try pinging or running a small network test with iperf3
I didn't know about this tool. I'll try it out and report back the results.
I ran iperf3
as a server in my Mac M2 and then spun up a client from the Quadro RTX5k machine:
Mac output:
iperf3 -s [ruby-2.6.10p210]
-----------------------------------------------------------
Server listening on 5201 (test #1)
-----------------------------------------------------------
iperf3: error - unable to receive parameters from client:
-----------------------------------------------------------
Server listening on 5201 (test #2)
-----------------------------------------------------------
Accepted connection from 10.0.4.81, port 38314
[ 5] local 10.0.4.39 port 5201 connected to 10.0.4.81 port 38318
[ ID] Interval Transfer Bitrate
[ 5] 0.00-1.01 sec 48.1 MBytes 402 Mbits/sec
[ 5] 1.01-2.01 sec 40.9 MBytes 343 Mbits/sec
[ 5] 2.01-3.00 sec 39.9 MBytes 335 Mbits/sec
[ 5] 3.00-4.00 sec 42.1 MBytes 353 Mbits/sec
[ 5] 4.00-5.00 sec 55.4 MBytes 465 Mbits/sec
[ 5] 5.00-6.00 sec 52.8 MBytes 444 Mbits/sec
[ 5] 6.00-7.00 sec 53.8 MBytes 451 Mbits/sec
[ 5] 7.00-8.00 sec 53.5 MBytes 447 Mbits/sec
[ 5] 8.00-9.00 sec 51.2 MBytes 431 Mbits/sec
[ 5] 9.00-10.00 sec 53.8 MBytes 450 Mbits/sec
[ 5] 10.00-10.02 sec 1.38 MBytes 512 Mbits/sec
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval Transfer Bitrate
[ 5] 0.00-10.02 sec 493 MBytes 412 Mbits/sec receiver
-----------------------------------------------------------
Server listening on 5201 (test #3)
-----------------------------------------------------------
Linux output:
Connecting to host 10.0.4.39, port 5201
[ 5] local 10.0.4.81 port 38318 connected to 10.0.4.39 port 5201
[ ID] Interval Transfer Bitrate Retr Cwnd
[ 5] 0.00-1.00 sec 49.9 MBytes 418 Mbits/sec 593 2.34 MBytes
[ 5] 1.00-2.00 sec 41.2 MBytes 346 Mbits/sec 352 1.74 MBytes
[ 5] 2.00-3.00 sec 40.0 MBytes 336 Mbits/sec 0 1.84 MBytes
[ 5] 3.00-4.00 sec 42.5 MBytes 357 Mbits/sec 0 1.91 MBytes
[ 5] 4.00-5.00 sec 56.2 MBytes 472 Mbits/sec 37 1.40 MBytes
[ 5] 5.00-6.00 sec 52.5 MBytes 440 Mbits/sec 0 1.48 MBytes
[ 5] 6.00-7.00 sec 53.8 MBytes 451 Mbits/sec 0 1.55 MBytes
[ 5] 7.00-8.00 sec 53.8 MBytes 451 Mbits/sec 0 1.59 MBytes
[ 5] 8.00-9.00 sec 51.2 MBytes 430 Mbits/sec 0 1.62 MBytes
[ 5] 9.00-10.00 sec 53.8 MBytes 451 Mbits/sec 0 1.64 MBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval Transfer Bitrate Retr
[ 5] 0.00-10.00 sec 495 MBytes 415 Mbits/sec 982 sender
[ 5] 0.00-10.02 sec 493 MBytes 412 Mbits/sec receiver
iperf Done.
Do you see anything off? Let me know if you need more data.
Thanks!
I have a cluster with 3 machines:
I still couldn't get the Linux nodes to show their TFLOPS, it's still showing as 0 (zero), but it doesn't seem to be related to the issue (though maybe I'm wrong?). AFAIK, nvidia-cuda is installed and working (via
nvidia-cuda-toolkit
apt package, I'm using the v560 (open) driver from nvidia).I'm trying to run Llama 3.1 8B. As soon as I open tinychat in any of the nodes and start typing with Llama 8B selected, after a while RTX 5000 node fails with:
Then the RTX 4000 node fails with:
And finally, here's the log for the M2 node:
I often only get the first few chars of the LLM answer and then it stops.
Any ideas on why these are failing? All nodes are using the
2b9dec2
. I'm using python3.12 on all systems and I'm activating the venv before starting it. On the Linux systems I start it withCUDA=1 exo
and on the Mac system withexo --inference-engine tinygrad
.Thanks in advance!