Open yuezhan0721 opened 3 months ago
Hello, what a router are you using?
Thank you for your reply. It's just based on the results in your first form. For example,Llama 2 7B,from 34.06ms to 289.75, it has increased to nearly 10 times. What factors do you think restrict the communication between devices?
Could you put a link to these results? Normally, the synchronization time is very similar during the inference like here.
I was wondering about it as well, I suppose in this case the problem may be a weak router/switch. I used cheap TP-Link LS1008G Switch. It may slow down under heavy load.
The other thing is that, the amout of data required to synchronize doesn't grow lineary if you look at the amout of parameters (7B, 13B, 70B). Most important are parameters of the model like the amout of blocks (7B: 32, 70B: 80) or the lenght of the "dim" vector etc.
For example Llama 2 70B on 4 devices with Q80 buffer requires 14917 kB to synchronize the state. Grok-1 314B only 21013 kB but it's 4.4x larger (!).
Also if you look at this report, where the link between nodes was highly efficient the transfer time on 2 and 4 devices is similar for 70B model. But the amout of bytes is amlost 3x for 4 devices (28.50 ms, 14917 kB) than 2 devices (25.00 ms, 5525 kB).
I was wondering about it as well, I suppose in this case the problem may be a weak router/switch. I used cheap TP-Link LS1008G Switch. It may slow down under heavy load.
The other thing is that, the amout of data required to synchronize doesn't grow lineary if you look at the amout of parameters (7B, 13B, 70B). Most important are parameters of the model like the amout of blocks (7B: 32, 70B: 80) or the lenght of the "dim" vector etc.
For example Llama 2 70B on 4 devices with Q80 buffer requires 14917 kB to synchronize the state. Grok-1 314B only 21013 kB but it's 4.4x larger (!).
Also if you look at this report, where the link between nodes was highly efficient the transfer time on 2 and 4 devices is similar for 70B model. But the amout of bytes is amlost 3x for 4 devices (28.50 ms, 14917 kB) than 2 devices (25.00 ms, 5525 kB).
Thanks for your explanation.
I was wondering about it as well, I suppose in this case the problem may be a weak router/switch. I used cheap TP-Link LS1008G Switch. It may slow down under heavy load.
The other thing is that, the amout of data required to synchronize doesn't grow lineary if you look at the amout of parameters (7B, 13B, 70B). Most important are parameters of the model like the amout of blocks (7B: 32, 70B: 80) or the lenght of the "dim" vector etc.
For example Llama 2 70B on 4 devices with Q80 buffer requires 14917 kB to synchronize the state. Grok-1 314B only 21013 kB but it's 4.4x larger (!).
Also if you look at this report, where the link between nodes was highly efficient the transfer time on 2 and 4 devices is similar for 70B model. But the amout of bytes is amlost 3x for 4 devices (28.50 ms, 14917 kB) than 2 devices (25.00 ms, 5525 kB).
Thank you, have you implemented any experiments on a high-end switch, e.g. Google Cloud service? If the result supports the poor switch hypothesis, then the bottleneck of this repo is not communication overhead.
@zhengpeirong please check this report (4 x c3d-highcpu-30 / Google Cloud).
For Llama 7B / Q40 Weights Q80 Buffer I got:
Metric | 1 VM | 2 VMs | 4 VMs |
---|---|---|---|
Avg transfer time | 0.19 ms | 7.62 ms | 12.81 ms |
The data needed for the synchronization per 1 token (Q80 Buffer):
Model | 2 devices | 4 devices | 8 devices |
---|---|---|---|
Llama 2 7B | 1112 kB | 2830 kB | 6008 kB |
So yes, if you have enough fast link between nodes the communication is not the bottleneck.
Btw: USB4 link may achieve 10Gbps. Google Cloud is much much slower than this.
@zhengpeirong please check this report (4 x c3d-highcpu-30 / Google Cloud).
For Llama 7B / Q40 Weights Q80 Buffer I got:
Metric 1 VM 2 VMs 4 VMs Avg transfer time 0.19 ms 7.62 ms 12.81 ms The data needed for the synchronization per 1 token (Q80 Buffer):
Model 2 devices 4 devices 8 devices Llama 2 7B 1112 kB 2830 kB 6008 kB So yes, if you have enough fast link between nodes the communication is not the bottleneck.
Btw: USB4 link may achieve 10Gbps. Google Cloud is much much slower than this.
Thank you!! I have checked the specifications of the switch: Packet Buffer Memory | 1.5 Mb Jumbo Frames | 16 KB Although I don't know the process behind the switch, this "Packet Buffer Memory" may be the key challenge for communication when it's less than the size of synchronization data.
@b4rtaz I recommend making synchronization more often with smaller chunks instead of the whole QKV or FFN result. This solution should benefit the transfer time by reducing the communication overhead to the level of a low-end switch.
Today, I tried a minor adjustment to the order of synchronization:
(current) matmul Q -> matmul K -> matmul V -> sync Q -> sync K -> sync V
VS
(new) matmul Q -> sync Q -> matmul K -> sync K -> matmul V -> sync V
Setup: 4 x Raspberry Pi 5 8GB, Llama 3 8B Q40, Q80 Buffer, TP-Link LS1008G Switch.
Results:
I found the llamaSyncAttQ, llamaSyncAttK, llamaSyncAttV task all of them set to TASK_TYPE_INFERENCE. That maybe affect the transfer time statistics.
Setup: the same as before.
Commit: ad10e18
Test 1:
b4rtaz@raspberrypi3:~/distributed-llama $ ./main inference --prompt "The Eiffel Tower is" --weights-float-type q40 --buffer-float-type q80 --nthreads 4 --model ../dllama_meta-llama-3-8b_q40.bin --tokenizer ../dllama-llama3-tokenizer.t --steps 64 --workers 10.0.0.4:9999 10.0.0.1:9999 10.0.0.2:9999
💡 arch: llama2
💡 dim: 4096
💡 hiddenDim: 14336
💡 nLayers: 32
💡 nHeads: 32
💡 nKvHeads: 8
💡 vocabSize: 128256
💡 seqLen: 2048
💡 nSlices: 4
💡 ropeTheta: 500000.0
📄 bosId: 128000
📄 eosId: 128001
⏩ Loaded 6323781632 bytes
...
Generated tokens: 64
Avg generation time: 343.39 ms
Avg inference time: 258.80 ms
Avg transfer time: 82.66 ms
Test 2:
Generated tokens: 64
Avg generation time: 347.48 ms
Avg inference time: 257.58 ms
Avg transfer time: 79.97 ms
Test 3:
Generated tokens: 64
Avg generation time: 339.42 ms
Avg inference time: 258.86 ms
Avg transfer time: 78.42 ms
Test 4:
Generated tokens: 64
Avg generation time: 334.41 ms
Avg inference time: 251.34 ms
Avg transfer time: 80.67 ms
Commit: 7f63f9e
Test 1:
Generated tokens: 64
Avg generation time: 329.61 ms
Avg inference time: 252.23 ms
Avg transfer time: 75.52 ms
Test 2:
Generated tokens: 64
Avg generation time: 333.89 ms
Avg inference time: 253.94 ms
Avg transfer time: 78.00 ms
Test 3:
Generated tokens: 64
Avg generation time: 330.98 ms
Avg inference time: 252.69 ms
Avg transfer time: 76.47 ms
Test 4:
Generated tokens: 64
Avg generation time: 327.75 ms
Avg inference time: 247.88 ms
Avg transfer time: 77.30 ms
So we have for 0.3.0 = 80.43 ms
vs for 0.3.1 = 76.82 ms
(n=4).
My setup looks very non-deterministic. Yesterday I observed the average inference time close to 320.00 ms
, today it's close to 250ms
. 🤯 I may have set the cooling fan better today. My setup is a bit improvised:
Yestarday I achieved a similar average for the transfer time for old version as today for new.
So I think my tests cannot prove or disprove that if this approach is better.
Hello, what a router are you using?