Tensor parallelism is all you need. Run LLMs on weak devices or make powerful devices even more powerful by distributing the workload and dividing the RAM usage.
ubuntu@ubuntu:~$ sudo nice -n -20 ./dllama worker --port 9998 --nthreads 4
Listening on 0.0.0.0:9998...
terminate called after throwing an instance of 'ReadSocketException'
what(): std::exception
Aborted
ubuntu@ubuntu:~$ sudo nice -n -20 ./dllama worker --port 9998 --nthreads 4
Listening on 0.0.0.0:9998...
💡 sliceIndex: 1
💡 nSlices: 4
🕒 ropeCache: 7680 kB
⏩ Received 6048 kB for block 0 (448 kB/s)
⏩ Received 6048 kB for block 1 (2729 kB/s)
⏩ Received 6048 kB for block 2 (2845 kB/s)
⏩ Received 6048 kB for block 3 (2786 kB/s)
⏩ Received 6048 kB for block 4 (2805 kB/s)
⏩ Received 6048 kB for block 5 (2925 kB/s)
⏩ Received 6048 kB for block 6 (2953 kB/s)
⏩ Received 6048 kB for block 7 (3095 kB/s)
⏩ Received 6048 kB for block 8 (3622 kB/s)
⏩ Received 6048 kB for block 9 (3830 kB/s)
⏩ Received 6048 kB for block 10 (3895 kB/s)
⏩ Received 6048 kB for block 11 (3849 kB/s)
⏩ Received 6048 kB for block 12 (3832 kB/s)
⏩ Received 6048 kB for block 13 (3847 kB/s)
⏩ Received 6048 kB for block 14 (3821 kB/s)
⏩ Received 6048 kB for block 15 (3922 kB/s)
⏩ Received 6048 kB for block 16 (3452 kB/s)
⏩ Received 6048 kB for block 17 (3859 kB/s)
⏩ Received 6048 kB for block 18 (3985 kB/s)
⏩ Received 6048 kB for block 19 (3379 kB/s)
⏩ Received 6048 kB for block 20 (3788 kB/s)
⏩ Received 6048 kB for block 21 (4115 kB/s)
f32 will not start. i just converted the same model as q40 and seems to work fine. i tried with
./dllama inference
as wellf32:
q40: