b4rtaz / distributed-llama

Tensor parallelism is all you need. Run LLMs on weak devices or make powerful devices even more powerful by distributing the workload and dividing the RAM usage.
MIT License
1.03k stars 69 forks source link

Master process crashes running out of memory on a 8 GB RPi 5 #8

Closed segabor closed 5 months ago

segabor commented 5 months ago

I setup a single master-worker pair to experiment with distributed llama. The master is actually an RPi 5 with 8 GB RAM and the only worker is a RPi 4 having the same memory size. When I run the inference, the master crashes after a while with segfault. The worker also quits due to closed socket connection. Any idea why? I tried with the smallest, llama-2-7b model.

Terminal capture from master

segabor@bigfive:~/src/distributed-llama $ ./run.sh 
💡 dim: 4096
💡 hiddenDim: 11008
💡 nLayers: 32
💡 nHeads: 32
💡 nKvHeads: 32
💡 vocabSize: 32000
💡 seqLen: 2048
💡 nSlices: 2
./run.sh: line 9: 268004 Segmentation fault      sudo nice -n -20 ./main inference --model /mnt/data/llama-2-7b/dllama_llama-2-7b_q40.bin --tokenizer ./tokenizer.bin --weights-float-type q40 --buffer-float-type q80 --prompt "Hello world" --steps 16 --nthreads 4 --workers 30.0.0.12:9998

Worker capture:

^Csegabor@lohere:~/src/distributed-llama$ ./run.sh 
Listening on 0.0.0.0:9998...
Client connected
💡 sliceIndex: 1
💡 nSlices: 2
⏩ Received 56918016 bytes for block 0 (83092 kB/s)
⏩ Received 56918016 bytes for block 1 (112709 kB/s)
⏩ Received 56918016 bytes for block 2 (112486 kB/s)
⏩ Received 56918016 bytes for block 3 (112709 kB/s)
⏩ Received 56918016 bytes for block 4 (91069 kB/s)
⏩ Received 56918016 bytes for block 5 (114986 kB/s)
⏩ Received 56918016 bytes for block 6 (103865 kB/s)
⏩ Received 56918016 bytes for block 7 (106190 kB/s)
⏩ Received 56918016 bytes for block 8 (112709 kB/s)
⏩ Received 56918016 bytes for block 9 (63172 kB/s)
⏩ Received 56918016 bytes for block 10 (63172 kB/s)
⏩ Received 56918016 bytes for block 11 (63313 kB/s)
⏩ Received 56918016 bytes for block 12 (63313 kB/s)
⏩ Received 56918016 bytes for block 13 (63172 kB/s)
⏩ Received 56918016 bytes for block 14 (60810 kB/s)
⏩ Received 56918016 bytes for block 15 (64097 kB/s)
⏩ Received 56918016 bytes for block 16 (60551 kB/s)
⏩ Received 56918016 bytes for block 17 (60358 kB/s)
⏩ Received 56918016 bytes for block 18 (60423 kB/s)
⏩ Received 56918016 bytes for block 19 (61600 kB/s)
⏩ Received 56918016 bytes for block 20 (62205 kB/s)
⏩ Received 56918016 bytes for block 21 (61136 kB/s)
⏩ Received 56918016 bytes for block 22 (62138 kB/s)
⏩ Received 56918016 bytes for block 23 (64753 kB/s)
⏩ Received 56918016 bytes for block 24 (100208 kB/s)
⏩ Received 56918016 bytes for block 25 (112486 kB/s)
⏩ Received 56918016 bytes for block 26 (112486 kB/s)
⏩ Received 56918016 bytes for block 27 (114064 kB/s)
⏩ Received 56918016 bytes for block 28 (111823 kB/s)
⏩ Received 56918016 bytes for block 29 (111168 kB/s)
Error receiving data: socket closed
segabor commented 5 months ago

rpi5_dmesg.log

b4rtaz commented 5 months ago
  1. Could you run single instance on your RPi 4?
  2. Could you run single instance on your RPi 5?
segabor commented 5 months ago

Both crashed with sigsegv indicating they ran out of memory.

The command I used on both devices

sudo nice -n -20 ./main inference \
  --model ./dllama_llama-2-7b_q40.bin \
  --tokenizer ./tokenizer.bin \
  --weights-float-type q40 \
  --buffer-float-type q80 \
  --prompt "Hello world" \
  --steps 16 \
  --nthreads 4
b4rtaz commented 5 months ago

Could you confirm the size of your file with weights?

b4rtaz@b4rtazs-MacBook-Pro converter % ls -l
total 267075104
drwxr-xr-x@ 3 b4rtaz  staff           96 Dec  9 00:40 __pycache__
-rw-r--r--@ 1 b4rtaz  staff         6310 Jan  7 22:12 converter.py
-rw-r--r--@ 1 b4rtaz  staff   7887097884 Jan  8 13:09 dllama_llama-2-13b_q40.bin
-rw-r--r--@ 1 b4rtaz  staff  39706066972 Jan  8 01:05 dllama_llama-2-70b_q40.bin
-rw-r--r--@ 1 b4rtaz  staff   4242882588 Jan  7 22:23 dllama_llama-2-7b_q40.bin
...

In your logs of the root node I don't see this part:

...
💡 nSlices: 1
⏩ Loaded 4242882560 bytes

So it looks like the weights file doesn't have all bytes.

segabor commented 5 months ago

Here's the list of the llama model 7b and the converted weights file

segabor@bigfive:~/src/distributed-llama $ ls -l /mnt/data/llama-2-7b/
total 17024788
-rw-r--r-- 1 segabor segabor         100 Jan 25 07:09 checklist.chk
-rw-r--r-- 1 segabor segabor 13476925163 Jan 25 07:09 consolidated.00.pth
-rw-r--r-- 1 segabor segabor  3956441088 Jan 25 13:26 dllama_llama-2-7b_q40.bin
-rw-r--r-- 1 segabor segabor         105 Jan 25 09:46 params.json

Thanks! Apparently the size of weights doesn't match the right file on your list. It's of 70b! I'm going to close this ticket, no error!

segabor commented 5 months ago

I've made the conversion again and it fixed the single run.

The latest run on my RPi 5 looks below

segabor@bigfive:~/src/distributed-llama $ ./run_single.sh 
💡 dim: 4096
💡 hiddenDim: 11008
💡 nLayers: 32
💡 nHeads: 32
💡 nKvHeads: 32
💡 vocabSize: 32000
💡 seqLen: 2048
💡 nSlices: 1
⏩ Loaded 4242882560 bytes
🔶 G 2460 ms I 2458 ms T    0 ms S      0 kB R      0 kB Hello
🔶 G 2409 ms I 2409 ms T    0 ms S      0 kB R      0 kB  world
🔶 G 2398 ms I 2397 ms T    0 ms S      0 kB R      0 kB ,
🔶 G 2400 ms I 2399 ms T    0 ms S      0 kB R      0 kB  I
🔶 G 2433 ms I 2428 ms T    4 ms S      0 kB R      0 kB '
🔶 G 2406 ms I 2405 ms T    0 ms S      0 kB R      0 kB m
🔶 G 2438 ms I 2432 ms T    4 ms S      0 kB R      0 kB  new
🔶 G 2403 ms I 2402 ms T    0 ms S      0 kB R      0 kB  to
🔶 G 2405 ms I 2404 ms T    0 ms S      0 kB R      0 kB  this
🔶 G 2407 ms I 2406 ms T    0 ms S      0 kB R      0 kB  and
🔶 G 2453 ms I 2452 ms T    0 ms S      0 kB R      0 kB  have
🔶 G 2408 ms I 2407 ms T    0 ms S      0 kB R      0 kB  a
🔶 G 2411 ms I 2410 ms T    0 ms S      0 kB R      0 kB  question
🔶 G 2416 ms I 2415 ms T    0 ms S      0 kB R      0 kB  for
🔶 G 2416 ms I 2415 ms T    0 ms S      0 kB R      0 kB  you
🔶 G 2448 ms I 2447 ms T    0 ms S      0 kB R      0 kB .
Generated tokens:    16
Avg generation time: 2419.44 ms
Avg inference time:  2417.88 ms
Avg transfer time:   0.50 ms
b4rtaz commented 5 months ago

Cool! Slightly poor performance, single 4B reaches 1312.50 ms per token. Have you started the inference with --nthreads 4?

segabor commented 5 months ago

@b4rtaz yes, threads are set to 4. But I realized the main was unoptimized. After recompiled with -O3, single-node run performed as below,

💡 dim: 4096
💡 hiddenDim: 11008
💡 nLayers: 32
💡 nHeads: 32
💡 nKvHeads: 32
💡 vocabSize: 32000
💡 seqLen: 2048
💡 nSlices: 1
⏩ Loaded 4242882560 bytes
🔶 G  420 ms I  418 ms T    0 ms S      0 kB R      0 kB Hello
🔶 G  495 ms I  491 ms T    4 ms S      0 kB R      0 kB  world
🔶 G  410 ms I  407 ms T    2 ms S      0 kB R      0 kB !
🔶 G  414 ms I  413 ms T    0 ms S      0 kB R      0 kB  The
🔶 G  410 ms I  409 ms T    0 ms S      0 kB R      0 kB  new
🔶 G  453 ms I  444 ms T    8 ms S      0 kB R      0 kB  year
🔶 G  414 ms I  412 ms T    1 ms S      0 kB R      0 kB  is
🔶 G  447 ms I  442 ms T    4 ms S      0 kB R      0 kB  upon
🔶 G  446 ms I  442 ms T    4 ms S      0 kB R      0 kB  us
🔶 G  412 ms I  411 ms T    0 ms S      0 kB R      0 kB ,
🔶 G  448 ms I  444 ms T    4 ms S      0 kB R      0 kB  and
🔶 G  413 ms I  412 ms T    0 ms S      0 kB R      0 kB  as
🔶 G  449 ms I  448 ms T    0 ms S      0 kB R      0 kB  always
🔶 G  452 ms I  448 ms T    4 ms S      0 kB R      0 kB ,
🔶 G  451 ms I  446 ms T    4 ms S      0 kB R      0 kB  we
🔶 G  446 ms I  446 ms T    0 ms S      0 kB R      0 kB  have
Generated tokens:    16
Avg generation time: 436.25 ms
Avg inference time:  433.31 ms
Avg transfer time:   2.19 ms
b4rtaz commented 5 months ago

Wow! Nice acceleration compared to 4B.

segabor commented 5 months ago

Yeah, I'd expect some improvements on a successor board. I also tested the code with CM4's mounted with 8 GB RAM. Initial results:

Model = Llama 2 7b Single Node 2 Nodes 4 Nodes
Avg Gen Time 448.00 748.94 491.38
Avg inference time 442.06 259.94 166.44
Avg transfer time 5.25 488.62 324.50

Where the master was my RPi 5 and the remaining workers were the CMs. Unfortunately I don't own more RPi 4 or CM4 modules so I wasn't able to test a 8 node system.

b4rtaz commented 5 months ago

I think your results are correct. I think the problem here is that you use devices with different processor speed.

Basically, you got results limited by the slowest device. CM4 is basically RasPi4 (1.5 GHz), so in my tests I got 793.69 ms for 2x RasPi4B. You have almost the same result. Distributed Llama doesn't split load depends on the processor speed.

You should observe much better results if you would use 2x RasPi5.

segabor commented 5 months ago

A more robust technique would be like distributing workloads using k8s or similar orchestration. But it's another story. By the way I'm expecting a brand new Rockchip based SoC with 32 GB RAM arrive early next month. Once I get it I will post a single-node benchmark. https://turingpi.com/exciting-updates-on-turing-rk1-compute-module/

b4rtaz commented 5 months ago

A more robust technique would be like distributing workloads using k8s or similar orchestration.

In last days I tested a few configurations of VMs in Google Cloud. This is the best test so far.

Once I get it I will post a single-node benchmark

Cool!

Vrownie commented 3 months ago

Hi all, just want to report some findings with multiple Pi 5 8GB nodes:

llama-2 7B 1 Node 2 Nodes 4 Nodes
Avg generation time 419.56 ms 297.10 ms 241.50 ms
Avg inference time 412.76 ms 254.48 ms 163.24 ms
Avg transfer time 6.40 ms 42.08 ms 77.90 ms

Surprised to see that going from 2 to 4 only yields such little improvements... Any thoughts?

b4rtaz commented 3 months ago

@Vrownie I think your results are correct, after you added 2 more devices you should expect close to 2x improvement (not 4x).

You can understand it in this way, if yo will assume that 1 device needs to perform 1000 operations, that means in the best scenario 2 devices need to perform 500 operations each (2x faster). But 4 devices need to perform 250 operations each (4x faster).

412.76 ms (1 node) /  254.48 ms (2 nodes)  => 1.6 (close to 2)
412.76 ms (1 node) /  163.24 ms (4 node)  => 2.5 (close to 4) 
254.48 ms (2 node) /  163.24 ms (4 node)  => 1.5 (close to 2)

Another factor is that, the root node always has a bit more calculation to perform than workers, so the execution time doesn't not decrease linearly.