Tensor parallelism is all you need. Run LLMs on weak devices or make powerful devices even more powerful by distributing the workload and dividing the RAM usage.
This PR adds a basic support for x86 AVX2 processor (Q40 x Q80 matmul).
@b4rtaz ➜ /workspaces/distributed-llama (feat/avx2) $ lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
Address sizes: 48 bits physical, 48 bits virtual
CPU(s): 8
On-line CPU(s) list: 0-7
Thread(s) per core: 2
Core(s) per socket: 4
Socket(s): 1
NUMA node(s): 1
Vendor ID: AuthenticAMD
CPU family: 25
Model: 1
Model name: AMD EPYC 7763 64-Core Processor
Stepping: 1
CPU MHz: 3091.301
BogoMIPS: 4890.86
Virtualization: AMD-V
Hypervisor vendor: Microsoft
Virtualization type: full
L1d cache: 128 KiB
L1i cache: 128 KiB
L2 cache: 2 MiB
L3 cache: 32 MiB
NUMA node0 CPU(s): 0-7
Vulnerability Gather data sampling: Not affected
Vulnerability Itlb multihit: Not affected
Vulnerability L1tf: Not affected
Vulnerability Mds: Not affected
Vulnerability Meltdown: Not affected
Vulnerability Mmio stale data: Not affected
Vulnerability Retbleed: Not affected
Vulnerability Spec rstack overflow: Mitigation; safe RET, no microcode
Vulnerability Spec store bypass: Vulnerable
Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user point
er sanitization
Vulnerability Spectre v2: Mitigation; Retpolines, STIBP disabled, RSB filling,
PBRSB-eIBRS Not affected
Vulnerability Srbds: Not affected
Vulnerability Tsx async abort: Not affected
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge
mca cmov pat pse36 clflush mmx fxsr sse sse2 ht sysca
ll nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc
rep_good nopl tsc_reliable nonstop_tsc cpuid extd_api
cid aperfmperf pni pclmulqdq ssse3 fma cx16 pcid sse4
_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hype
rvisor lahf_lm cmp_legacy svm cr8_legacy abm sse4a mi
salignsse 3dnowprefetch osvw topoext invpcid_single v
mmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid rdse
ed adx smap clflushopt clwb sha_ni xsaveopt xsavec xg
etbv1 xsaves clzero xsaveerptr rdpru arat npt nrip_sa
ve tsc_scale vmcb_clean flushbyasid decodeassists pau
sefilter pfthreshold v_vmsave_vmload umip vaes vpclmu
lqdq rdpid fsrm
@b4rtaz ➜ /workspaces/distributed-llama (feat/avx2) $ sudo nice -n -20 ./main inference --model ../dllama_llama-2-7b_q40.bin --tokenizer ../tokenizer.bin --weights-float-type q40 --buffer-float-type q80 --prompt "Hello world" --steps 16 --nthreads 8
💡 dim: 4096
💡 hiddenDim: 11008
💡 nLayers: 32
💡 nHeads: 32
💡 nKvHeads: 32
💡 vocabSize: 32000
💡 seqLen: 2048
💡 nSlices: 1
⏩ Loaded 4242882560 bytes
🔶 G 250 ms I 246 ms T 4 ms S 0 kB R 0 kB Hello
🔶 G 242 ms I 242 ms T 0 ms S 0 kB R 0 kB world
🔶 G 241 ms I 241 ms T 0 ms S 0 kB R 0 kB ,
🔶 G 346 ms I 330 ms T 16 ms S 0 kB R 0 kB my
🔶 G 249 ms I 249 ms T 0 ms S 0 kB R 0 kB name
🔶 G 267 ms I 267 ms T 0 ms S 0 kB R 0 kB is
🔶 G 314 ms I 288 ms T 26 ms S 0 kB R 0 kB Luis
🔶 G 235 ms I 233 ms T 1 ms S 0 kB R 0 kB Med
🔶 G 243 ms I 242 ms T 1 ms S 0 kB R 0 kB ina
🔶 G 240 ms I 239 ms T 1 ms S 0 kB R 0 kB and
🔶 G 256 ms I 254 ms T 1 ms S 0 kB R 0 kB I
🔶 G 243 ms I 242 ms T 0 ms S 0 kB R 0 kB ’
🔶 G 235 ms I 235 ms T 0 ms S 0 kB R 0 kB m
🔶 G 234 ms I 233 ms T 1 ms S 0 kB R 0 kB from
🔶 G 257 ms I 257 ms T 0 ms S 0 kB R 0 kB San
🔶 G 292 ms I 289 ms T 3 ms S 0 kB R 0 kB Jose
Generated tokens: 16
Avg generation time: 259.00 ms
Avg inference time: 255.44 ms
Avg transfer time: 3.38 ms
This PR adds a basic support for x86 AVX2 processor (Q40 x Q80 matmul).