b4rtaz / distributed-llama

Tensor parallelism is all you need. Run LLMs on weak devices or make powerful devices even more powerful by distributing the workload and dividing the RAM usage.
MIT License
1.02k stars 68 forks source link

feat: speed up synchronization of mlp. #64

Closed b4rtaz closed 1 month ago

b4rtaz commented 1 month ago

Transfer / token

Model: dllama_meta-llama-3-8b_q40.bin Buffer: Q80

Devices 0.5.0 PR Percentage change
2 devices S 510 kB + R 442 kB = 952 kB S 272 kB + R 272 kB = 544 kB -42.8%
4 devices S 1887 kB + R 867 kB = 2754 kB S 816 kB + R 816 kB = 1632 kB -40.7%

🤯 🤯

b4rtaz commented 1 month ago

Llama 2 7B Q40

nTokens = 90, buffer = Q80

4 x Rasperry Pi 5 8GB

Version Avg tokens / second Avg generation time Avg inference time Avg transfer time
This PR 4.08 245.08 ms 169.33 ms 75.34 ms
0.7.0 3.90 256.23 ms 168.77 ms 87.12 ms
0.6.0 4.24 235.69 ms 143.44 ms 91.77 ms

2 x Rasperry Pi 5 8GB

Version Avg tokens / second Avg generation time Avg inference time Avg transfer time
This PR 3.07 325.46 ms 269.04 ms 56.39 ms
0.7.0 2.91 343.44 ms 266.51 ms 76.87 ms
0.6.0 3.06 327.17 ms 249.80 ms 77.28 ms

Tinylama 1.3B 3T Q40

nTokens = 128, buffer = Q80

2 x Rasperry Pi 5 8GB

Version Avg tokens / second Avg generation time Avg inference time Avg transfer time
This PR 16.86 59.31 ms 50.37 ms 8.58 ms
0.7.0 15.17 65.93 ms 52.07 ms 13.45 ms

Llama 3 8B Q40

nTokens = 90, buffer = Q80

2 x AMD EPYC 7402P 24-Core Processor

Version Avg tokens / second Avg generation time Avg inference time Avg transfer time
This PR 13.04 76.67 ms 45.33 ms 30.93 ms
0.7.0 12.79 78.21 ms 46.30 ms 31.49 ms
0.6.0 12.55 79.71 ms 47.08 ms 32.22 ms