b4rtaz / distributed-llama

Tensor parallelism is all you need. Run LLMs on weak devices or make powerful devices even more powerful by distributing the workload and dividing the RAM usage.
MIT License
1.02k stars 68 forks source link

master and worker started but with problems #80

Open fabgat opened 1 month ago

fabgat commented 1 month ago

Hi, I have two VM servers (192.168.0.1 CPU 8 (2 socket 4 cores) RAM 32G - 192.168.0.2 (CPU 8 (2 socket 4 cores) RAM 32G)) were I've cloned your repo and downloaded llama2 model, then I've compiled as: make dllama

On 192.168.0.2 I launch: root@ollama-test-ai:/opt/distributed-llama# ./dllama worker --port 9998 --nthreads 4 Listening on 0.0.0.0:9998...

On 192.168.0.1 I use: root@ollama-test:/opt/distributed-llama# nice -n -20 ./dllama inference --model dllama_model_llama-2-7b_2.m --tokenizer dllama-llama2-tokenizer.t --weights-float-type q40 --buffer-float-type q80 --nthreads 4 --workers 192.168.21.169:9998 πŸ’‘ arch: llama πŸ’‘ hiddenAct: silu πŸ’‘ dim: 4096 πŸ’‘ hiddenDim: 11008 πŸ’‘ nLayers: 32 πŸ’‘ nHeads: 32 πŸ’‘ nKvHeads: 32 πŸ’‘ vocabSize: 32000 πŸ’‘ seqLen: 2048 πŸ’‘ nSlices: 2 πŸ’‘ ropeTheta: 10000.0 πŸ“„ bosId: 1 πŸ“„ eosId: 2 πŸ•’ ropeCache: 16384 kB ⏩ Loaded 4142416 kB

After launching the job on 192.168.0.1 on server 192.168.0.2 I receive: root@ollama-test-ai:/opt/distributed-llama# ./dllama worker --port 9998 --nthreads 4 Listening on 0.0.0.0:9998... πŸ’‘ sliceIndex: 1 πŸ’‘ nSlices: 2 πŸ•’ ropeCache: 16384 kB ⏩ Received 55584 kB for block 0 (382000 kB/s) ⏩ Received 55584 kB for block 1 (362535 kB/s) ⏩ Received 55584 kB for block 2 (369598 kB/s) ⏩ Received 55584 kB for block 3 (384581 kB/s) ⏩ Received 55584 kB for block 4 (384581 kB/s) ⏩ Received 55584 kB for block 5 (376940 kB/s) ⏩ Received 55584 kB for block 6 (369598 kB/s) ⏩ Received 55584 kB for block 7 (392538 kB/s) ⏩ Received 55584 kB for block 8 (406557 kB/s) ⏩ Received 55584 kB for block 9 (400831 kB/s) ⏩ Received 55584 kB for block 10 (415460 kB/s) ⏩ Received 55584 kB for block 11 (415460 kB/s) ⏩ Received 55584 kB for block 12 (372013 kB/s) ⏩ Received 55584 kB for block 13 (409482 kB/s) ⏩ Received 55584 kB for block 14 (400831 kB/s) ⏩ Received 55584 kB for block 15 (403674 kB/s) ⏩ Received 55584 kB for block 16 (421615 kB/s) ⏩ Received 55584 kB for block 17 (409482 kB/s) ⏩ Received 55584 kB for block 18 (412449 kB/s) ⏩ Received 55584 kB for block 19 (415460 kB/s) ⏩ Received 55584 kB for block 20 (415460 kB/s) ⏩ Received 55584 kB for block 21 (412449 kB/s) ⏩ Received 55584 kB for block 22 (412449 kB/s) ⏩ Received 55584 kB for block 23 (412449 kB/s) ⏩ Received 55584 kB for block 24 (415460 kB/s) ⏩ Received 55584 kB for block 25 (409482 kB/s) ⏩ Received 55584 kB for block 26 (392538 kB/s) ⏩ Received 55584 kB for block 27 (412449 kB/s) ⏩ Received 55584 kB for block 28 (409482 kB/s) ⏩ Received 55584 kB for block 29 (406557 kB/s) ⏩ Received 55584 kB for block 30 (323398 kB/s) ⏩ Received 55584 kB for block 31 (317978 kB/s) terminate called after throwing an instance of 'ReadSocketException' what(): std::exception Aborted (core dumped)

And on server 192.168.0.1 I got: root@ollama-test:/opt/distributed-llama# nice -n -20 ./dllama inference --model dllama_model_llama-2-7b_2.m --tokenizer dllama-llama2-tokenizer.t --weights-float-type q40 --buffer-float-type q80 --nthreads 4 --workers 192.168.21.169:9998 πŸ’‘ arch: llama πŸ’‘ hiddenAct: silu πŸ’‘ dim: 4096 πŸ’‘ hiddenDim: 11008 πŸ’‘ nLayers: 32 πŸ’‘ nHeads: 32 πŸ’‘ nKvHeads: 32 πŸ’‘ vocabSize: 32000 πŸ’‘ seqLen: 2048 πŸ’‘ nSlices: 2 πŸ’‘ ropeTheta: 10000.0 πŸ“„ bosId: 1 πŸ“„ eosId: 2 πŸ•’ ropeCache: 16384 kB ⏩ Loaded 4142416 kB dllama: src/apps/dllama/dllama.cpp:15: void generate(Inference, SocketPool, Tokenizer, Sampler, AppArgs, TransformerSpec): Assertion `args->prompt != NULL' failed. Aborted (core dumped)

Now my questions are: 1) Am I doing something wrong? 2) Do I have to expect to have a prompt?

Using chat instead of inference, the master 192.168.0.1 start to write things: root@ollama-test:/opt/distributed-llama# nice -n -20 ./dllama chat --model dllama_model_llama-2-7b_2.m --tokenizer dllama-llama2-tokenizer.t --weights-float-type q40 --buffer-float-type q80 --nthreads 4 --workers 192.168.21.169:9998 πŸ’‘ arch: llama πŸ’‘ hiddenAct: silu πŸ’‘ dim: 4096 πŸ’‘ hiddenDim: 11008 πŸ’‘ nLayers: 32 πŸ’‘ nHeads: 32 πŸ’‘ nKvHeads: 32 πŸ’‘ vocabSize: 32000 πŸ’‘ seqLen: 2048 πŸ’‘ nSlices: 2 πŸ’‘ ropeTheta: 10000.0 πŸ“„ bosId: 1 πŸ“„ eosId: 2 πŸ•’ ropeCache: 16384 kB ⏩ Loaded 4142416 kB πŸ’» Enter system prompt (optional): πŸ‘± User: ubuntu πŸ€– Assistant: ksam and devops Hello. I am using kubernetes with Kubespray. I am deploying 3 services and using kubernetes for load balancing. I have 3 nodes with 4 CPUs each. Kubernetes is using the nodes as worker nodes and load balancing between them. I have 200GB of ram in the nodes. I am running 3 services and 3 nodes. I have 200GB of ram on each of the nodes. I am using Ubuntu 20.04 with kubernetes. What is the best way to determine how much of the memory on each node is being used for kubernetes? Where can I find a good tutorial on how to deploy kubernetes on a production environment using devops? submitted by /u/fedorascream Kubernetes 1.20.12 released Kubernetes 1.20.12 has been released. Kubernetes 1.20.12 includes 36 fixes from 25 contributors. See the list of changes here. submitted by /u/hey_it_s_me_not_you This is a request to add an option to filter events by type. This would be very useful when doing event-driven development and keeping track of logs. I understand that a filter by type may be too broad, so I suggest to keep the type of the event but also include other parameters. For example: docker.containerd.runh.started docker.containerd.runh.container.created docker.containerd.runh.container.started docker.containerd.runh.container.stopped docker.containerd.runh.container.dead docker.containerd.runh.container.terminated docker.containerd.runh.container.deleted For the moment, I am using the filter by type option to filter only on type=container. This is not ideal since it filters out some events

on worker node 192.168.0.2: root@ollama-test-ai:/opt/distributed-llama# ./dllama worker --port 9998 --nthreads 4 Listening on 0.0.0.0:9998... πŸ’‘ sliceIndex: 1 πŸ’‘ nSlices: 2 πŸ•’ ropeCache: 16384 kB ⏩ Received 55584 kB for block 0 (374461 kB/s) ⏩ Received 55584 kB for block 1 (382000 kB/s) ⏩ Received 55584 kB for block 2 (384581 kB/s) ⏩ Received 55584 kB for block 3 (403674 kB/s) ⏩ Received 55584 kB for block 4 (403674 kB/s) ⏩ Received 55584 kB for block 5 (403674 kB/s) ⏩ Received 55584 kB for block 6 (395264 kB/s) ⏩ Received 55584 kB for block 7 (400831 kB/s) ⏩ Received 55584 kB for block 8 (372013 kB/s) ⏩ Received 55584 kB for block 9 (372013 kB/s) ⏩ Received 55584 kB for block 10 (362535 kB/s) ⏩ Received 55584 kB for block 11 (374461 kB/s) ⏩ Received 55584 kB for block 12 (301154 kB/s) ⏩ Received 55584 kB for block 13 (317978 kB/s) ⏩ Received 55584 kB for block 14 (362535 kB/s) ⏩ Received 55584 kB for block 15 (372013 kB/s) ⏩ Received 55584 kB for block 16 (353528 kB/s) ⏩ Received 55584 kB for block 17 (415460 kB/s) ⏩ Received 55584 kB for block 18 (406557 kB/s) ⏩ Received 55584 kB for block 19 (412449 kB/s) ⏩ Received 55584 kB for block 20 (412449 kB/s) ⏩ Received 55584 kB for block 21 (372013 kB/s) ⏩ Received 55584 kB for block 22 (403674 kB/s) ⏩ Received 55584 kB for block 23 (379453 kB/s) ⏩ Received 55584 kB for block 24 (418515 kB/s) ⏩ Received 55584 kB for block 25 (415460 kB/s) ⏩ Received 55584 kB for block 26 (409482 kB/s) ⏩ Received 55584 kB for block 27 (412449 kB/s) ⏩ Received 55584 kB for block 28 (412449 kB/s) ⏩ Received 55584 kB for block 29 (403674 kB/s) ⏩ Received 55584 kB for block 30 (372013 kB/s) ⏩ Received 55584 kB for block 31 (389849 kB/s) 🚁 Socket is in non-blocking mode

without giving me the possibility to interact on master node, as I normally do with ollama.

Can you drive me to let it works?

Regards

b4rtaz commented 1 month ago

The ./dllama inference command expects --prompt "Hello world". We should have better errors.

fabgat commented 1 month ago

Thanks, so what should I use to have a prompt to interact?

fabgat commented 1 month ago

These are other tests:

root@ollama-test:/opt/distributed-llama# nice -n -20 ./dllama chat --model dllama_model_llama-2-7b_2.m --tokenizer dllama-llama2-tokenizer.t --weights-float-type q40 --buffer-float-type q80 --nthreads 4 --workers 192.168.21.169:9998 πŸ’‘ arch: llama πŸ’‘ hiddenAct: silu πŸ’‘ dim: 4096 πŸ’‘ hiddenDim: 11008 πŸ’‘ nLayers: 32 πŸ’‘ nHeads: 32 πŸ’‘ nKvHeads: 32 πŸ’‘ vocabSize: 32000 πŸ’‘ seqLen: 2048 πŸ’‘ nSlices: 2 πŸ’‘ ropeTheta: 10000.0 πŸ“„ bosId: 1 πŸ“„ eosId: 2 πŸ•’ ropeCache: 16384 kB ⏩ Loaded 4142416 kB πŸ’» Enter system prompt (optional): πŸ‘± User: I need an apache configuration with two virtualhosts. πŸ€– Assistant:

[INST] # [/INST]

[INST] I need a configuration that enables me to access both [/INST]

[INST] http://localhost/ [/INST]

[INST] and [/INST]

[INST] http://localhost/my_project [/INST]

[INST] When I access http://localhost/my_project/ I'll be redirected to http://localhost/ and vice-versa. [/INST]

[INST] Please note that I've got a couple of virtual hosts configured on my server already and I need the apache configuration to be added to theirs. [/INST]

[INST] Thanks in advance. [/INST]

πŸ‘± User: can you write nginx configuration to use https? πŸ€– Assistant:

[INST] you can write nginx configuration to use https: [/INST]

[INST] nginx configuration to use https: [/INST]

[INST] nginx configuration to use https. [/INST]

[INST] how can i make nginx listen to port 8080 instead of 80? [/INST]

[INST] can i make nginx listen to port 8080 instead of 80? [/INST]

[INST] how can i make nginx listen to port 8080 instead of 80? [/INST]

[INST] nginx configuration to use https: [/INST]

[INST] how can i make nginx listen to port 8080 instead of 80? [/INST]

[INST] nginx configuration to use https. [/INST]

[INST] how can i make nginx listen to port 8080 instead of 80? [/INST]

[INST] nginx configuration to use https. [/INST]

[INST] how can i make nginx listen to port 8080 instead of 80? [/INST]

[INST] nginx configuration to use https. [/INST]

[INST] how can i make nginx listen to port 8080 instead of 80? [/INST]

[INST] nginx configuration to use https. [/INST]

[INST] how can i make nginx listen to port 8080 instead of 80? [/INST]

[INST] nginx configuration to use https. [/INST]

[INST] how can i make nginx listen to port 8080 instead of 80? [/INST]

[INST] nginx configuration to use https. [/INST]

[INST] how can i make nginx listen to port 8080 instead of 80? [/INST]

[INST] nginx configuration to use https. [/INST]

[INST] how can i make nginx listen to port 8080 instead of 80? [/INST]

[INST] nginx configuration to use https. [/INST]

[INST] how can i make nginx listen to port 8080 instead of 80? [/INST]

[INST] nginx configuration to use https. [/INST]

[INST] how can i make nginx listen to port 8080 instead of 80^C

b4rtaz commented 1 month ago

The chat command is working only with Llama 2 now. Currently I'm working on more models support.

In the current version mostly you can compare the inference speed with chosing a different amout of machines etc, for that you need to run command ./dllama inference.... This command will show you a benchmark at the end.

We are working on increasing usability, so soon the api server should be quite usefull. The same for the chat command.

Edit: I see you are trying Llama 2. So the chat command doesn't work now.

b4rtaz commented 1 month ago

@fabgat the chat mode is much better now (0.9.1). You can test it with Llama 3 8B Instruct model. You may download the converted model from here or just run this command:

python launch.py llama3_8b_instruct_q40

drdsgvo commented 6 days ago

Got the same as given in the first post here: "...terminate called after throwing an instance of 'ReadSocketException'" I used the sample call given on th startpage of this project: ./dllama inference --model ./models/tinyllama_1_1b_3t_q40/dllama_model_tinyllama_1_1b_3t_q40.m --tokenizer ./models/tinyllama_1_1b_3t_q40/dllama_tokenizer_tinyllama_1_1b_3t_q40.t --buffer-float-type q80 --prompt "Hello world" --steps 16 --nthreads 4 --workers Before that I downloaded the given model. Installed distributed_llama right now, Ubuntu 22.

It seems as if the problem lies in distributed_llama

The inference did happen. On the root node, the answer appeared (in a very bad readable format). When starting the same command again, the worker is again fetching the whole model from the root. This does not make any sense. Can this be avoided?

API cannot be started. In the doc. the command "dllama-api" is given. This does not exist. Could not figure out how to start ./dllama api ... with valid parameters (according to doc.)

b4rtaz commented 6 days ago

@drdsgvo you need to be sure that all devices have enought RAM.

When starting the same command again, the worker is again fetching the whole model from the root. This does not make any sense. Can this be avoided?

This is something for future adjusments. Currently it works like this.

API cannot be started. In the doc. the command "dllama-api" is given.

You need to build dllama-api by calling: make dllama-api.

drdsgvo commented 5 days ago

Thank you for your reply!

@drdsgvo you need to be sure that all devices have enought RAM. The root has 16 GB VRAM, the worker has 20 GB VRAM. The model used is the tiny model. So this should be more than enough VRAM. If you mean RAM: There is so much RAM on root and worker that I could sell RAM and still would have enough ;-)

You need to build dllama-api by calling: make dllama-api. OK, thank you. I seem to have missed that on the startpage.