Hi, do you know why the synchronization time from 4pi to 8pi suddenly increases？

b4rtaz commented 3 months ago

Hello, what a router are you using?

yuezhan0721 commented 3 months ago

Hello, what a router are you using?

Thank you for your reply. It's just based on the results in your first form. For example，Llama 2 7B，from 34.06ms to 289.75, it has increased to nearly 10 times. What factors do you think restrict the communication between devices?

b4rtaz commented 3 months ago

Could you put a link to these results? Normally, the synchronization time is very similar during the inference like here.

yuezhan0721 commented 3 months ago

Could you put a link to these results? Normally, the synchronization time is very similar during the inference like here.

Thanks，like thishere QQ截图20240407175014

b4rtaz commented 3 months ago

I was wondering about it as well, I suppose in this case the problem may be a weak router/switch. I used cheap TP-Link LS1008G Switch. It may slow down under heavy load.

The other thing is that, the amout of data required to synchronize doesn't grow lineary if you look at the amout of parameters (7B, 13B, 70B). Most important are parameters of the model like the amout of blocks (7B: 32, 70B: 80) or the lenght of the "dim" vector etc.

For example Llama 2 70B on 4 devices with Q80 buffer requires 14917 kB to synchronize the state. Grok-1 314B only 21013 kB but it's 4.4x larger (!).

Also if you look at this report, where the link between nodes was highly efficient the transfer time on 2 and 4 devices is similar for 70B model. But the amout of bytes is amlost 3x for 4 devices (28.50 ms, 14917 kB) than 2 devices (25.00 ms, 5525 kB).

yuezhan0721 commented 3 months ago

I was wondering about it as well, I suppose in this case the problem may be a weak router/switch. I used cheap TP-Link LS1008G Switch. It may slow down under heavy load.

The other thing is that, the amout of data required to synchronize doesn't grow lineary if you look at the amout of parameters (7B, 13B, 70B). Most important are parameters of the model like the amout of blocks (7B: 32, 70B: 80) or the lenght of the "dim" vector etc.

For example Llama 2 70B on 4 devices with Q80 buffer requires 14917 kB to synchronize the state. Grok-1 314B only 21013 kB but it's 4.4x larger (!).

Also if you look at this report, where the link between nodes was highly efficient the transfer time on 2 and 4 devices is similar for 70B model. But the amout of bytes is amlost 3x for 4 devices (28.50 ms, 14917 kB) than 2 devices (25.00 ms, 5525 kB).

Thanks for your explanation.

zhengpeirong commented 2 months ago

I was wondering about it as well, I suppose in this case the problem may be a weak router/switch. I used cheap TP-Link LS1008G Switch. It may slow down under heavy load.

The other thing is that, the amout of data required to synchronize doesn't grow lineary if you look at the amout of parameters (7B, 13B, 70B). Most important are parameters of the model like the amout of blocks (7B: 32, 70B: 80) or the lenght of the "dim" vector etc.

For example Llama 2 70B on 4 devices with Q80 buffer requires 14917 kB to synchronize the state. Grok-1 314B only 21013 kB but it's 4.4x larger (!).

Also if you look at this report, where the link between nodes was highly efficient the transfer time on 2 and 4 devices is similar for 70B model. But the amout of bytes is amlost 3x for 4 devices (28.50 ms, 14917 kB) than 2 devices (25.00 ms, 5525 kB).

Thank you, have you implemented any experiments on a high-end switch, e.g. Google Cloud service? If the result supports the poor switch hypothesis, then the bottleneck of this repo is not communication overhead.

b4rtaz commented 2 months ago

@zhengpeirong please check this report (4 x c3d-highcpu-30 / Google Cloud).

For Llama 7B / Q40 Weights Q80 Buffer I got:

Metric	1 VM	2 VMs	4 VMs
Avg transfer time	0.19 ms	7.62 ms	12.81 ms

The data needed for the synchronization per 1 token (Q80 Buffer):

Model	2 devices	4 devices	8 devices
Llama 2 7B	1112 kB	2830 kB	6008 kB

So yes, if you have enough fast link between nodes the communication is not the bottleneck.

Btw: USB4 link may achieve 10Gbps. Google Cloud is much much slower than this.

zhengpeirong commented 2 months ago

@zhengpeirong please check this report (4 x c3d-highcpu-30 / Google Cloud).

For Llama 7B / Q40 Weights Q80 Buffer I got:

Metric 1 VM 2 VMs 4 VMs Avg transfer time 0.19 ms 7.62 ms 12.81 ms The data needed for the synchronization per 1 token (Q80 Buffer):

Model 2 devices 4 devices 8 devices Llama 2 7B 1112 kB 2830 kB 6008 kB So yes, if you have enough fast link between nodes the communication is not the bottleneck.

Btw: USB4 link may achieve 10Gbps. Google Cloud is much much slower than this.

Thank you!! I have checked the specifications of the switch: Packet Buffer Memory | 1.5 Mb Jumbo Frames | 16 KB Although I don't know the process behind the switch, this "Packet Buffer Memory" may be the key challenge for communication when it's less than the size of synchronization data.

zhengpeirong commented 2 months ago

@b4rtaz I recommend making synchronization more often with smaller chunks instead of the whole QKV or FFN result. This solution should benefit the transfer time by reducing the communication overhead to the level of a low-end switch.

b4rtaz commented 2 months ago

Today, I tried a minor adjustment to the order of synchronization:

(current) matmul Q -> matmul K -> matmul V -> sync Q -> sync K -> sync V

VS

(new)     matmul Q -> sync Q -> matmul K -> sync K -> matmul V -> sync V

Setup: 4 x Raspberry Pi 5 8GB, Llama 3 8B Q40, Q80 Buffer, TP-Link LS1008G Switch.

Results:

Edit: I'm hidding these results because it contains an error, check the discussion below

### current ``` Test 1: ⏩ Loaded 6323781632 bytes 🔶 G 705 ms I 583 ms T 118 ms S 2877687 kB R 714 kB The 🔶 G 352 ms I 251 ms T 101 ms S 2295 kB R 714 kB E 🔶 G 407 ms I 278 ms T 129 ms S 2295 kB R 714 kB iff 🔶 G 398 ms I 285 ms T 113 ms S 2295 kB R 714 kB el 🔶 G 394 ms I 282 ms T 112 ms S 2295 kB R 714 kB Tower 🔶 G 386 ms I 276 ms T 110 ms S 2295 kB R 714 kB is 🔶 G 377 ms I 273 ms T 103 ms S 2295 kB R 714 kB a ... 🔶 G 447 ms I 392 ms T 54 ms S 2295 kB R 714 kB 🔶 G 428 ms I 354 ms T 72 ms S 2295 kB R 714 kB 108 🔶 G 425 ms I 356 ms T 67 ms S 2295 kB R 714 kB 0 🔶 G 384 ms I 318 ms T 64 ms S 2295 kB R 714 kB steps Generated tokens: 64 Avg generation time: 405.23 ms Avg inference time: 326.27 ms Avg transfer time: 77.47 ms Test 2: ⏩ Loaded 6323781632 bytes 🔶 G 407 ms I 292 ms T 110 ms S 2877687 kB R 714 kB The 🔶 G 351 ms I 249 ms T 102 ms S 2295 kB R 714 kB E 🔶 G 393 ms I 247 ms T 146 ms S 2295 kB R 714 kB iff 🔶 G 386 ms I 243 ms T 143 ms S 2295 kB R 714 kB el 🔶 G 347 ms I 246 ms T 101 ms S 2295 kB R 714 kB Tower 🔶 G 339 ms I 246 ms T 93 ms S 2295 kB R 714 kB is ... 🔶 G 426 ms I 353 ms T 71 ms S 2295 kB R 714 kB . 🔶 G 406 ms I 348 ms T 57 ms S 2295 kB R 714 kB It 🔶 G 429 ms I 361 ms T 66 ms S 2295 kB R 714 kB is 🔶 G 423 ms I 345 ms T 76 ms S 2295 kB R 714 kB the Generated tokens: 64 Avg generation time: 400.53 ms Avg inference time: 322.92 ms Avg transfer time: 76.03 ms ``` ### new Commit: [d5b8354](https://github.com/b4rtaz/distributed-llama/commit/d5b83549347b2f085836d000fcffd942aaeb2c73) ``` Test 1: ⏩ Loaded 6323781632 bytes 🔶 G 425 ms I 308 ms T 114 ms S 2877687 kB R 714 kB The 🔶 G 355 ms I 263 ms T 92 ms S 2295 kB R 714 kB E 🔶 G 337 ms I 259 ms T 78 ms S 2295 kB R 714 kB iff 🔶 G 338 ms I 259 ms T 79 ms S 2295 kB R 714 kB el 🔶 G 362 ms I 269 ms T 93 ms S 2295 kB R 714 kB Tower 🔶 G 377 ms I 259 ms T 118 ms S 2295 kB R 714 kB is 🔶 G 339 ms I 260 ms T 77 ms S 2295 kB R 714 kB a ... 🔶 G 425 ms I 369 ms T 54 ms S 2295 kB R 714 kB and 🔶 G 401 ms I 356 ms T 44 ms S 2295 kB R 714 kB was 🔶 G 403 ms I 346 ms T 56 ms S 2295 kB R 714 kB completed 🔶 G 422 ms I 338 ms T 82 ms S 2295 kB R 714 kB in Generated tokens: 64 Avg generation time: 384.66 ms Avg inference time: 318.31 ms Avg transfer time: 64.97 ms Test 2: ⏩ Loaded 6323781632 bytes 🔶 G 374 ms I 298 ms T 71 ms S 2877687 kB R 714 kB The 🔶 G 357 ms I 261 ms T 96 ms S 2295 kB R 714 kB E 🔶 G 363 ms I 264 ms T 99 ms S 2295 kB R 714 kB iff 🔶 G 340 ms I 259 ms T 81 ms S 2295 kB R 714 kB el 🔶 G 340 ms I 256 ms T 84 ms S 2295 kB R 714 kB Tower 🔶 G 341 ms I 257 ms T 84 ms S 2295 kB R 714 kB is 🔶 G 340 ms I 261 ms T 78 ms S 2295 kB R 714 kB a ... 🔶 G 430 ms I 380 ms T 48 ms S 2295 kB R 714 kB of 🔶 G 444 ms I 386 ms T 57 ms S 2295 kB R 714 kB the 🔶 G 461 ms I 403 ms T 56 ms S 2295 kB R 714 kB most 🔶 G 420 ms I 354 ms T 64 ms S 2295 kB R 714 kB recognizable Generated tokens: 64 Avg generation time: 392.34 ms Avg inference time: 326.58 ms Avg transfer time: 64.03 ms ``` Conclusions: it seems this change reduced the synchronization time by 12ms / token, what is a very good improvement. It looks like there is more to improve if this works.

cewuandy commented 2 months ago

I found the llamaSyncAttQ, llamaSyncAttK, llamaSyncAttV task all of them set to TASK_TYPE_INFERENCE. That maybe affect the transfer time statistics.

b4rtaz commented 2 months ago

@cewuandy Nice catch! Fixed it, later I'll retest it.

b4rtaz commented 2 months ago

Setup: the same as before.

0.3.0

Commit: ad10e18

Test 1:

b4rtaz@raspberrypi3:~/distributed-llama $ ./main inference --prompt "The Eiffel Tower is" --weights-float-type q40 --buffer-float-type q80 --nthreads 4 --model ../dllama_meta-llama-3-8b_q40.bin --tokenizer ../dllama-llama3-tokenizer.t --steps 64 --workers 10.0.0.4:9999 10.0.0.1:9999 10.0.0.2:9999
💡 arch: llama2
💡 dim: 4096
💡 hiddenDim: 14336
💡 nLayers: 32
💡 nHeads: 32
💡 nKvHeads: 8
💡 vocabSize: 128256
💡 seqLen: 2048
💡 nSlices: 4
💡 ropeTheta: 500000.0
📄 bosId: 128000
📄 eosId: 128001
⏩ Loaded 6323781632 bytes
...
Generated tokens:    64
Avg generation time: 343.39 ms
Avg inference time:  258.80 ms
Avg transfer time:   82.66 ms

Test 2:

Generated tokens:    64
Avg generation time: 347.48 ms
Avg inference time:  257.58 ms
Avg transfer time:   79.97 ms

Test 3:

Generated tokens:    64
Avg generation time: 339.42 ms
Avg inference time:  258.86 ms
Avg transfer time:   78.42 ms

Test 4:

Generated tokens:    64
Avg generation time: 334.41 ms
Avg inference time:  251.34 ms
Avg transfer time:   80.67 ms

0.3.1

Commit: 7f63f9e

Test 1:

Generated tokens:    64
Avg generation time: 329.61 ms
Avg inference time:  252.23 ms
Avg transfer time:   75.52 ms

Test 2:

Generated tokens:    64
Avg generation time: 333.89 ms
Avg inference time:  253.94 ms
Avg transfer time:   78.00 ms

Test 3:

Generated tokens:    64
Avg generation time: 330.98 ms
Avg inference time:  252.69 ms
Avg transfer time:   76.47 ms

Test 4:

Generated tokens:    64
Avg generation time: 327.75 ms
Avg inference time:  247.88 ms
Avg transfer time:   77.30 ms

So we have for 0.3.0 = 80.43 ms vs for 0.3.1 = 76.82 ms (n=4).

My setup looks very non-deterministic. Yesterday I observed the average inference time close to 320.00 ms, today it's close to 250ms. 🤯 I may have set the cooling fan better today. My setup is a bit improvised:

Yestarday I achieved a similar average for the transfer time for old version as today for new.

So I think my tests cannot prove or disprove that if this approach is better.

b4rtaz commented 2 months ago

So for now I reverted these changes to the old one. The previous implementation is easier to maintain.

b4rtaz / distributed-llama

Hi, do you know why the synchronization time from 4pi to 8pi suddenly increases？ #20

0.3.0

0.3.1