b4rtaz / distributed-llama

Tensor parallelism is all you need. Run LLMs on weak devices or make powerful devices even more powerful by distributing the workload and dividing the RAM usage.
MIT License
1.03k stars 69 forks source link

Turing RK1 compute module results #10

Closed segabor closed 3 months ago

segabor commented 4 months ago

I promised to share results of Turing RK1 module. It arrived yesterday so I took the chance to run distributed llama on it. Capability: 8 cores, 32 GB RAM. Storage: 1 TB NVMe SSD OS: custom Ubuntu Server Model: llama-2-7b

Command

sudo nice -n -20 ./main inference \
  --model /mnt/bigdata/llama-2-7b/dllama_llama-2-7b_q40.bin \
  --tokenizer ./tokenizer.bin \
  --weights-float-type q40 \
  --buffer-float-type q80 \
  --prompt "Hello world" \
  --steps 16 \
  --nthreads 4

Result

💡 dim: 4096
💡 hiddenDim: 11008
💡 nLayers: 32
💡 nHeads: 32
💡 nKvHeads: 32
💡 vocabSize: 32000
💡 seqLen: 2048
💡 nSlices: 1
⏩ Loaded 4242882560 bytes
🔶 G  372 ms I  372 ms T    0 ms S      0 kB R      0 kB Hello
🔶 G  378 ms I  378 ms T    0 ms S      0 kB R      0 kB  world
🔶 G  369 ms I  367 ms T    1 ms S      0 kB R      0 kB ,
🔶 G  379 ms I  379 ms T    0 ms S      0 kB R      0 kB  I
🔶 G  424 ms I  397 ms T   27 ms S      0 kB R      0 kB '
🔶 G  376 ms I  376 ms T    0 ms S      0 kB R      0 kB m
🔶 G  378 ms I  377 ms T    0 ms S      0 kB R      0 kB  E
🔶 G  407 ms I  407 ms T    0 ms S      0 kB R      0 kB .
🔶 G  383 ms I  380 ms T    0 ms S      0 kB R      0 kB  січня
🔶 G  372 ms I  371 ms T    1 ms S      0 kB R      0 kB  
🔶 G  379 ms I  378 ms T    0 ms S      0 kB R      0 kB 2
🔶 G  374 ms I  373 ms T    0 ms S      0 kB R      0 kB 0
🔶 G  382 ms I  381 ms T    0 ms S      0 kB R      0 kB 1
🔶 G  375 ms I  373 ms T    2 ms S      0 kB R      0 kB 8
🔶 G  378 ms I  377 ms T    1 ms S      0 kB R      0 kB  at
🔶 G  382 ms I  382 ms T    0 ms S      0 kB R      0 kB  
Generated tokens:    16
Avg generation time: 381.75 ms
Avg inference time:  379.25 ms
Avg transfer time:   2.00 ms
segabor commented 4 months ago

This is a CPU only run. Rockchip has a unique neural engine with 6 TOPS performance. I will try to unlock it but it takes some time.

b4rtaz commented 4 months ago

Interesting...

Turing RK1 costs $149 for 8GB, so it gives $24.8 / TOPS. GeForce RTX 4070 costs ~$550 for 29.15 TFLOPS, so $18.9 / TFLOP. Units are different, but I suppose still it's cheaper to buy a few Geforce's.

segabor commented 4 months ago

No doubt GPU wins the contest of GPU/NPU performance per dollars. But it's nice to have an NPU for an IoT board.

b4rtaz commented 3 months ago

I'm closing this issue (because it's not a issue) and moving the results here.