Tensor parallelism is all you need. Run LLMs on weak devices or make powerful devices even more powerful by distributing the workload and dividing the RAM usage.
terminate called after throwing an instance of 'std::runtime_error'
what(): Cannot create socket
My worker is currently happily running I believe on the same computer. Just to test if I would get any response from it, i did try having silly tavern connect to it which obviously failed but also crashed the worker so the port is definitely reachable.
C:\SWARM\distributed-llama>dllama worker --port 9998 --nthreads 4
Listening on 0.0.0.0:9998...
I got local inference working but when I try to use workers I get this error.
dllama inference --model dllama_model_tinyllama_1_1b_3t_q40.m --tokenizer dllama_tokenizer_tinyllama_1_1b_3t_q40.t --buffer-float-type q80 --prompt "Hello world" --steps 16 --nthreads 4 --workers 192.168.0.1:9998
terminate called after throwing an instance of 'std::runtime_error' what(): Cannot create socket
My worker is currently happily running I believe on the same computer. Just to test if I would get any response from it, i did try having silly tavern connect to it which obviously failed but also crashed the worker so the port is definitely reachable.
C:\SWARM\distributed-llama>dllama worker --port 9998 --nthreads 4 Listening on 0.0.0.0:9998...