Turing RK1 compute module results

segabor commented 4 months ago

I promised to share results of Turing RK1 module. It arrived yesterday so I took the chance to run distributed llama on it. Capability: 8 cores, 32 GB RAM. Storage: 1 TB NVMe SSD OS: custom Ubuntu Server Model: llama-2-7b

Command

sudo nice -n -20 ./main inference \
  --model /mnt/bigdata/llama-2-7b/dllama_llama-2-7b_q40.bin \
  --tokenizer ./tokenizer.bin \
  --weights-float-type q40 \
  --buffer-float-type q80 \
  --prompt "Hello world" \
  --steps 16 \
  --nthreads 4

Result

💡 dim: 4096
💡 hiddenDim: 11008
💡 nLayers: 32
💡 nHeads: 32
💡 nKvHeads: 32
💡 vocabSize: 32000
💡 seqLen: 2048
💡 nSlices: 1
⏩ Loaded 4242882560 bytes
🔶 G  372 ms I  372 ms T    0 ms S      0 kB R      0 kB Hello
🔶 G  378 ms I  378 ms T    0 ms S      0 kB R      0 kB  world
🔶 G  369 ms I  367 ms T    1 ms S      0 kB R      0 kB ,
🔶 G  379 ms I  379 ms T    0 ms S      0 kB R      0 kB  I
🔶 G  424 ms I  397 ms T   27 ms S      0 kB R      0 kB '
🔶 G  376 ms I  376 ms T    0 ms S      0 kB R      0 kB m
🔶 G  378 ms I  377 ms T    0 ms S      0 kB R      0 kB  E
🔶 G  407 ms I  407 ms T    0 ms S      0 kB R      0 kB .
🔶 G  383 ms I  380 ms T    0 ms S      0 kB R      0 kB  січня
🔶 G  372 ms I  371 ms T    1 ms S      0 kB R      0 kB  
🔶 G  379 ms I  378 ms T    0 ms S      0 kB R      0 kB 2
🔶 G  374 ms I  373 ms T    0 ms S      0 kB R      0 kB 0
🔶 G  382 ms I  381 ms T    0 ms S      0 kB R      0 kB 1
🔶 G  375 ms I  373 ms T    2 ms S      0 kB R      0 kB 8
🔶 G  378 ms I  377 ms T    1 ms S      0 kB R      0 kB  at
🔶 G  382 ms I  382 ms T    0 ms S      0 kB R      0 kB  
Generated tokens:    16
Avg generation time: 381.75 ms
Avg inference time:  379.25 ms
Avg transfer time:   2.00 ms

segabor commented 4 months ago

This is a CPU only run. Rockchip has a unique neural engine with 6 TOPS performance. I will try to unlock it but it takes some time.

b4rtaz commented 4 months ago

Interesting...

Turing RK1 costs $149 for 8GB, so it gives $24.8 / TOPS. GeForce RTX 4070 costs ~$550 for 29.15 TFLOPS, so $18.9 / TFLOP. Units are different, but I suppose still it's cheaper to buy a few Geforce's.

segabor commented 4 months ago

No doubt GPU wins the contest of GPU/NPU performance per dollars. But it's nice to have an NPU for an IoT board.

b4rtaz commented 3 months ago

I'm closing this issue (because it's not a issue) and moving the results here.

b4rtaz / distributed-llama

Turing RK1 compute module results #10

Command

Result