Closed segabor closed 5 months ago
Both crashed with sigsegv indicating they ran out of memory.
The command I used on both devices
sudo nice -n -20 ./main inference \
--model ./dllama_llama-2-7b_q40.bin \
--tokenizer ./tokenizer.bin \
--weights-float-type q40 \
--buffer-float-type q80 \
--prompt "Hello world" \
--steps 16 \
--nthreads 4
Could you confirm the size of your file with weights?
b4rtaz@b4rtazs-MacBook-Pro converter % ls -l
total 267075104
drwxr-xr-x@ 3 b4rtaz staff 96 Dec 9 00:40 __pycache__
-rw-r--r--@ 1 b4rtaz staff 6310 Jan 7 22:12 converter.py
-rw-r--r--@ 1 b4rtaz staff 7887097884 Jan 8 13:09 dllama_llama-2-13b_q40.bin
-rw-r--r--@ 1 b4rtaz staff 39706066972 Jan 8 01:05 dllama_llama-2-70b_q40.bin
-rw-r--r--@ 1 b4rtaz staff 4242882588 Jan 7 22:23 dllama_llama-2-7b_q40.bin
...
In your logs of the root node I don't see this part:
...
💡 nSlices: 1
⏩ Loaded 4242882560 bytes
So it looks like the weights file doesn't have all bytes.
Here's the list of the llama model 7b and the converted weights file
segabor@bigfive:~/src/distributed-llama $ ls -l /mnt/data/llama-2-7b/
total 17024788
-rw-r--r-- 1 segabor segabor 100 Jan 25 07:09 checklist.chk
-rw-r--r-- 1 segabor segabor 13476925163 Jan 25 07:09 consolidated.00.pth
-rw-r--r-- 1 segabor segabor 3956441088 Jan 25 13:26 dllama_llama-2-7b_q40.bin
-rw-r--r-- 1 segabor segabor 105 Jan 25 09:46 params.json
Thanks! Apparently the size of weights doesn't match the right file on your list. It's of 70b! I'm going to close this ticket, no error!
I've made the conversion again and it fixed the single run.
The latest run on my RPi 5 looks below
segabor@bigfive:~/src/distributed-llama $ ./run_single.sh
💡 dim: 4096
💡 hiddenDim: 11008
💡 nLayers: 32
💡 nHeads: 32
💡 nKvHeads: 32
💡 vocabSize: 32000
💡 seqLen: 2048
💡 nSlices: 1
⏩ Loaded 4242882560 bytes
🔶 G 2460 ms I 2458 ms T 0 ms S 0 kB R 0 kB Hello
🔶 G 2409 ms I 2409 ms T 0 ms S 0 kB R 0 kB world
🔶 G 2398 ms I 2397 ms T 0 ms S 0 kB R 0 kB ,
🔶 G 2400 ms I 2399 ms T 0 ms S 0 kB R 0 kB I
🔶 G 2433 ms I 2428 ms T 4 ms S 0 kB R 0 kB '
🔶 G 2406 ms I 2405 ms T 0 ms S 0 kB R 0 kB m
🔶 G 2438 ms I 2432 ms T 4 ms S 0 kB R 0 kB new
🔶 G 2403 ms I 2402 ms T 0 ms S 0 kB R 0 kB to
🔶 G 2405 ms I 2404 ms T 0 ms S 0 kB R 0 kB this
🔶 G 2407 ms I 2406 ms T 0 ms S 0 kB R 0 kB and
🔶 G 2453 ms I 2452 ms T 0 ms S 0 kB R 0 kB have
🔶 G 2408 ms I 2407 ms T 0 ms S 0 kB R 0 kB a
🔶 G 2411 ms I 2410 ms T 0 ms S 0 kB R 0 kB question
🔶 G 2416 ms I 2415 ms T 0 ms S 0 kB R 0 kB for
🔶 G 2416 ms I 2415 ms T 0 ms S 0 kB R 0 kB you
🔶 G 2448 ms I 2447 ms T 0 ms S 0 kB R 0 kB .
Generated tokens: 16
Avg generation time: 2419.44 ms
Avg inference time: 2417.88 ms
Avg transfer time: 0.50 ms
Cool! Slightly poor performance, single 4B reaches 1312.50 ms per token. Have you started the inference with --nthreads 4
?
@b4rtaz yes, threads are set to 4. But I realized the main
was unoptimized. After recompiled with -O3
, single-node run performed as below,
💡 dim: 4096
💡 hiddenDim: 11008
💡 nLayers: 32
💡 nHeads: 32
💡 nKvHeads: 32
💡 vocabSize: 32000
💡 seqLen: 2048
💡 nSlices: 1
⏩ Loaded 4242882560 bytes
🔶 G 420 ms I 418 ms T 0 ms S 0 kB R 0 kB Hello
🔶 G 495 ms I 491 ms T 4 ms S 0 kB R 0 kB world
🔶 G 410 ms I 407 ms T 2 ms S 0 kB R 0 kB !
🔶 G 414 ms I 413 ms T 0 ms S 0 kB R 0 kB The
🔶 G 410 ms I 409 ms T 0 ms S 0 kB R 0 kB new
🔶 G 453 ms I 444 ms T 8 ms S 0 kB R 0 kB year
🔶 G 414 ms I 412 ms T 1 ms S 0 kB R 0 kB is
🔶 G 447 ms I 442 ms T 4 ms S 0 kB R 0 kB upon
🔶 G 446 ms I 442 ms T 4 ms S 0 kB R 0 kB us
🔶 G 412 ms I 411 ms T 0 ms S 0 kB R 0 kB ,
🔶 G 448 ms I 444 ms T 4 ms S 0 kB R 0 kB and
🔶 G 413 ms I 412 ms T 0 ms S 0 kB R 0 kB as
🔶 G 449 ms I 448 ms T 0 ms S 0 kB R 0 kB always
🔶 G 452 ms I 448 ms T 4 ms S 0 kB R 0 kB ,
🔶 G 451 ms I 446 ms T 4 ms S 0 kB R 0 kB we
🔶 G 446 ms I 446 ms T 0 ms S 0 kB R 0 kB have
Generated tokens: 16
Avg generation time: 436.25 ms
Avg inference time: 433.31 ms
Avg transfer time: 2.19 ms
Wow! Nice acceleration compared to 4B.
Yeah, I'd expect some improvements on a successor board. I also tested the code with CM4's mounted with 8 GB RAM. Initial results:
Model = Llama 2 7b | Single Node | 2 Nodes | 4 Nodes |
---|---|---|---|
Avg Gen Time | 448.00 | 748.94 | 491.38 |
Avg inference time | 442.06 | 259.94 | 166.44 |
Avg transfer time | 5.25 | 488.62 | 324.50 |
Where the master was my RPi 5 and the remaining workers were the CMs. Unfortunately I don't own more RPi 4 or CM4 modules so I wasn't able to test a 8 node system.
I think your results are correct. I think the problem here is that you use devices with different processor speed.
Basically, you got results limited by the slowest device. CM4 is basically RasPi4 (1.5 GHz), so in my tests I got 793.69 ms for 2x RasPi4B. You have almost the same result. Distributed Llama doesn't split load depends on the processor speed.
You should observe much better results if you would use 2x RasPi5.
A more robust technique would be like distributing workloads using k8s or similar orchestration. But it's another story. By the way I'm expecting a brand new Rockchip based SoC with 32 GB RAM arrive early next month. Once I get it I will post a single-node benchmark. https://turingpi.com/exciting-updates-on-turing-rk1-compute-module/
A more robust technique would be like distributing workloads using k8s or similar orchestration.
In last days I tested a few configurations of VMs in Google Cloud. This is the best test so far.
Once I get it I will post a single-node benchmark
Cool!
Hi all, just want to report some findings with multiple Pi 5 8GB nodes:
llama-2 7B | 1 Node | 2 Nodes | 4 Nodes |
Avg generation time | 419.56 ms | 297.10 ms | 241.50 ms |
Avg inference time | 412.76 ms | 254.48 ms | 163.24 ms |
Avg transfer time | 6.40 ms | 42.08 ms | 77.90 ms |
Surprised to see that going from 2 to 4 only yields such little improvements... Any thoughts?
@Vrownie I think your results are correct, after you added 2 more devices you should expect close to 2x improvement (not 4x).
You can understand it in this way, if yo will assume that 1 device needs to perform 1000 operations, that means in the best scenario 2 devices need to perform 500 operations each (2x faster). But 4 devices need to perform 250 operations each (4x faster).
412.76 ms (1 node) / 254.48 ms (2 nodes) => 1.6 (close to 2)
412.76 ms (1 node) / 163.24 ms (4 node) => 2.5 (close to 4)
254.48 ms (2 node) / 163.24 ms (4 node) => 1.5 (close to 2)
Another factor is that, the root node always has a bit more calculation to perform than workers, so the execution time doesn't not decrease linearly.
I setup a single master-worker pair to experiment with distributed llama. The master is actually an RPi 5 with 8 GB RAM and the only worker is a RPi 4 having the same memory size. When I run the inference, the master crashes after a while with segfault. The worker also quits due to closed socket connection. Any idea why? I tried with the smallest, llama-2-7b model.
Terminal capture from master
Worker capture: