dot-ammar commented 4 months ago

With Llama38b, inference works, however api and chat do not. It produces a segmentation fault.

sudo nice ./dllama chat --model models/llama3_8b_instruct_q40/dllama_model_llama3_8b_instruct_q40.m --tokenizer models/llama3_8b_instruct_q40/dllama_tokenizer_llama3_8b_instruct_q40.t --buffer-float-type q80 --nthreads 4  --workers 192.168.0.172:9998 192.168.0.166:9998 192.168.0.163:9998
💡 arch: llama
💡 hiddenAct: silu
💡 dim: 4096
💡 hiddenDim: 14336
💡 nLayers: 32
💡 nHeads: 32
💡 nKvHeads: 8
💡 vocabSize: 128256
💡 seqLen: 8192
💡 nSlices: 4
💡 ropeTheta: 500000.0
📄 bosId: 128000
📄 eosId: 128001
📄 chatEosId: 128009
🕒 ropeCache: 32768 kB
⏩ Loaded 6175568 kB
⭐ chat template: llama3
🛑 stop: <|eot_id|>
💻 System prompt (optional):

👱 User 
> hello
# it pauses here for a while, cpu usage on all workers is nearly 100%
🤖 Assistant 
zsh: segmentation fault  sudo nice ./dllama chat --model  --tokenizer  --buffer-float-type q80  4

Workers terminate like:

sudo nice ./dllama worker --port 9998 --nthreads 4
Listening on 0.0.0.0:9998...
💡 sliceIndex: 1
💡 nSlices: 4
🕒 ropeCache: 57344 kB
⏩ Received 29952 kB for block 0 (2038 kB/s)
⏩ Received 29952 kB for block 1 (31783 kB/s)
⏩ Received 29952 kB for block 2 (31554 kB/s)
⏩ Received 29952 kB for block 3 (31783 kB/s)
⏩ Received 29952 kB for block 4 (36470 kB/s)
⏩ Received 29952 kB for block 5 (36211 kB/s)
⏩ Received 29952 kB for block 6 (36083 kB/s)
⏩ Received 29952 kB for block 7 (35830 kB/s)
⏩ Received 29952 kB for block 8 (35213 kB/s)
⏩ Received 29952 kB for block 9 (35335 kB/s)
⏩ Received 29952 kB for block 10 (35664 kB/s)
⏩ Received 29952 kB for block 11 (33230 kB/s)
⏩ Received 29952 kB for block 12 (35540 kB/s)
⏩ Received 29952 kB for block 13 (34307 kB/s)
⏩ Received 29952 kB for block 14 (32251 kB/s)
⏩ Received 29952 kB for block 15 (30671 kB/s)
⏩ Received 29952 kB for block 16 (26881 kB/s)
⏩ Received 29952 kB for block 17 (26036 kB/s)
⏩ Received 29952 kB for block 18 (25037 kB/s)
⏩ Received 29952 kB for block 19 (29577 kB/s)
⏩ Received 29952 kB for block 20 (37726 kB/s)
⏩ Received 29952 kB for block 21 (37633 kB/s)
⏩ Received 29952 kB for block 22 (36041 kB/s)
⏩ Received 29952 kB for block 23 (36340 kB/s)
⏩ Received 29952 kB for block 24 (35830 kB/s)
⏩ Received 29952 kB for block 25 (31297 kB/s)
⏩ Received 29952 kB for block 26 (32456 kB/s)
⏩ Received 29952 kB for block 27 (32150 kB/s)
⏩ Received 29952 kB for block 28 (31587 kB/s)
⏩ Received 29952 kB for block 29 (32979 kB/s)
⏩ Received 29952 kB for block 30 (34774 kB/s)
⏩ Received 29952 kB for block 31 (29662 kB/s)
🚁 Socket is in non-blocking mode
terminate called after throwing an instance of 'ReadSocketException'
  what():  std::exception
Aborted

I'm using 3 linux machines with 8gb of ram each, and an intel mac with 16gb as the root node.

This all works fine with tiny llama though.

b4rtaz commented 4 months ago

Hmm... weird. I cannot reproduce it on my mac (root node + 3 workers, model: dllama_model_llama3_8b_instruct_q40.m, tokenizer: models/llama3_8b_instruct_q40/dllama_tokenizer_llama3_8b_instruct_q40.t).

Could you try to run DL only on intel mac with 16 gb? You could try to check 2 configurations: 1 root node only and 1 root node + 3 workers on the same device.

dot-ammar commented 4 months ago

Yeah that produces the same issue

I was able to test it with a different device as the root node and it worked fine, might be an issue with some corporate setup. The new root node also has more memory, could be the reason. If I test it with another different device with memory similar to my mac I'll update you guys.

Thanks for the help though. This is a really awesome tool btw, wonder how it will perform for llama3.1 401b 👀

fromthefox commented 4 months ago

I met the same issue days ago. And segmentation fault is usually caused by non-correct pointer. So you may use gdb and ulimit setting to locate the wrong pointer's pos. Or you can use printf to locate it also. It's not a problem caused by a specific error, but any wrong pointers. My error is a pointer out of buffer. :-) Use ulimit -c 1000 and some settings will produce a core file when meet segmentation fault and you can use gdb to read this file to find the pos of wrong pointer. Good luck

b4rtaz commented 4 months ago

I wanted to check if this problem occurs on Linux, but it seems no.

@b4rtaz ➜ /workspaces/distributed-llama (main) $ uname -a
Linux codespaces-32fe28 6.5.0-1022-azure #23~22.04.1-Ubuntu SMP Thu May  9 17:59:24 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux

Llama 3.1 8B Q40:

⏩ Loaded 6175568 kB
⭐ chat template: llama3
🛑 stop: <|eot_id|>
🛑 stop: <|end_of_text|>
💻 System prompt (optional): 

👱 User
> hello

🤖 Assistant
hello back! How can I assist you today?
👱 User
> where is the largest volcano?

🤖 Assistant
that's a big question! There are many large volcanoes around the world, but some of the largest and most well-known ones include:ĊĊMauna Loa, Hawaii, USA: It's the largest volcano on Earth in terms of size and volume, with a total volume of around 75,000 cubic kilometers.ĊĊMauna Kea, Hawaii, USA: This shield volcano is the tallest mountain in the world, with a height of 4,207 meters

Llama 3 8B Q40:

⏩ Loaded 6175568 kB
⭐ chat template: llama3
🛑 stop: <|eot_id|>
💻 System prompt (optional): 

👱 User
> hello

🤖 Assistant
Hello! I'm happy to help you with any questions or topics you'd like to discuss. What's on your mind today?
👱 User
> what is 1+4?

🤖 Assistant
That's an easy one! The answer is 5.

Definitely it's not a simple case.

b4rtaz commented 4 months ago

I suspect the EosDetector class. @dot-ammar are you able to compile and run tokenizer-test?

make tokenizer-test
./tokenizer-test

EDIT: Meantime I found a tiny bug, maybe it was related.

lipere123 commented 3 months ago

Hello. I have the issue : terminate called after throwing an instance of 'ReadSocketException' what(): std::exception

Frankly, I think that is because I have disabled IPv6. What your point of view, please ?

b4rtaz commented 3 months ago

@lipere123 could you please provide a bit more context? What model are you trying to run, and how much RAM does each device have?

lipere123 commented 3 months ago

Hello.

Thank you very much for your quick answer. I have a supercomputer, 6 nodes, 96Cores, 768Go RAM, 6 PNY Nvidia RTX 4000 Ada Generation. I run Ubuntu 24.04 on the host, and the cluster is on LXC Ubuntu 22.04.

Network is a passthrouh like that : eth1: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 7000 inet 192.168.16.22 netmask 255.255.255.0 broadcast 192.168.16.255 ether e8:ea:6a:03:3a:06 txqueuelen 1000 (Ethernet) RX packets 220318748 bytes 319104638952 (319.1 GB) RX errors 1 dropped 6395 overruns 0 frame 1 TX packets 30638671 bytes 8062889108 (8.0 GB) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0

The distributed storage is ceph, mounted via hostpath LXC, ceph-cli will comme with an update version of my infrastructure code when I have the time. The models are linked to a folder on ceph via ln.

I have a minimum install on the master. Here the scripts : install_dllama_master.sh

!/bin/bash

cd /usr/local/ /bin/rm -Rf /usr/local/distributed-llama/ /usr/local/bin/dllama /usr/bin/git clone https://github.com/b4rtaz/distributed-llama.git cd /usr/local/distributed-llama/ /usr/bin/make dllama /bin/sleep 2 /bin/ln -s /usr/local/distributed-llama/dllama /usr/local/bin/dllama /bin/ln -s /lxdubu/share/dllama-models/ /usr/local/distributed-llama/models

End of script

exit 0

install_dllama.sh

!/bin/bash

cd /usr/local/ /bin/rm -Rf /usr/local/distributed-llama/ /usr/local/bin/dllama /usr/bin/git clone https://github.com/b4rtaz/distributed-llama.git cd /usr/local/distributed-llama/ /usr/bin/make dllama /bin/sleep 2 /bin/ln -s /usr/local/distributed-llama/dllama /usr/local/bin/dllama /bin/cp /share/llama/dllama-start.sh /opt/dllama-start.sh /bin/chmod +x /opt/dllama-start.sh /bin/sleep 2 /bin/rm -Rf /share/dllama-models/ /usr/local/distributed-llama/models/ /bin/mkdir -p /share/dllama-models/ /bin/ln -s /share/dllama-models/ /usr/local/distributed-llama/models

End of script

exit 0

dllama-models.sh

!/bin/bash

cd /usr/local/distributed-llama/ /usr/local/llm/bin/python3 /usr/local/distributed-llama/launch.py llama3_1_8b_instruct_q40 /usr/local/llm/bin/python3 /usr/local/distributed-llama/launch.py llama3_1_405b_instruct_q40

dllama-inference-llama3_1_8b.sh

!/bin/bash

cd /usr/local/distributed-llama/ ./dllama inference \ --model models/llama3_1_8b_instruct_q40/dllama_model_llama3_1_8b_instruct_q40.m \ --tokenizer models/llama3_1_8b_instruct_q40/dllama_tokenizer_llama3_1_8b_instruct_q40.t \ --buffer-float-type q80 \ --prompt "Hello world" \ --steps 64 \ --nthreads 4 \ --workers 192.168.16.22:9998 192.168.16.32:9998 192.168.16.42:9998 192.168.16.52:9998 192.168.16.62:9998 192.168.16.72:9998

dllama-run.sh

!/bin/bash

cd /root/ myhost=$(/bin/cat /etc/hostname | /usr/bin/tail -n 1 | /usr/bin/tr -d '\r\n') /usr/local/distributed-llama/dllama worker --port 9998 --nthreads 8 > /apps/logs/$myhost-dllama.log 2>&1

./dllama-inference-llama3_1_8b.sh 💡 arch: llama 💡 hiddenAct: silu 💡 dim: 4096 💡 hiddenDim: 14336 💡 nLayers: 32 💡 nHeads: 32 💡 nKvHeads: 8 💡 vocabSize: 128256 💡 seqLen: 131072 💡 nSlices: 7 💡 ropeTheta: 500000.0 📄 bosId: 128000 📄 eosId: 128009 📄 chatEosId: 128009 dllama: src/commands.cpp:98: KvCacheSlice::KvCacheSlice(unsigned int, unsigned int, unsigned int): Assertion `kvDim % nSlices == 0' failed. ./dllama-inference-llama3_1_8b.sh: line 10: 2867142 Aborted (core dumped) ./dllama inference --model models/llama3_1_8b_instruct_q40/dllama_model_llama3_1_8b_instruct_q40.m --tokenizer models/llama3_1_8b_instruct_q40/dllama_tokenizer_llama3_1_8b_instruct_q40.t --buffer-float-type q80 --prompt "Hello world" --steps 64 --nthreads 4 --workers 192.168.16.22:9998 192.168.16.32:9998 192.168.16.42:9998 192.168.16.52:9998 192.168.16.62:9998 192.168.16.72:9998

On my workers log : terminate called after throwing an instance of 'ReadSocketException' what(): std::exception

Thanks again. Best Regards. Benjamin.

b4rtaz commented 3 months ago

@lipere123 in this case I think the reason is that you are trying to run 7 nodes:

dllama: src/commands.cpp:98: KvCacheSlice::KvCacheSlice(unsigned int, unsigned int, unsigned int): Assertion 'kvDim % nSlices == 0' failed.

Distriubted Llama supports 1, 2, 4, 8, 16... (max is equal nKvHeads) nodes. So you should try with 4 (1 root + 3 workers) or 8 (1 root + 7 workers) nodes.

lipere123 commented 3 months ago

Hello.

Okay, now it is working for 8b. The worker are shutding down after inference, so I have to restart them for now. Is that a bug ??

Also for 405b : 💡 arch: llama 💡 hiddenAct: silu 💡 dim: 16384 💡 hiddenDim: 53248 💡 nLayers: 126 💡 nHeads: 128 💡 nKvHeads: 16 💡 vocabSize: 128256 💡 seqLen: 131072 💡 nSlices: 4 💡 ropeTheta: 500000.0 📄 bosId: 128000 📄 eosId: 128009 📄 chatEosId: 128009 ./dllama-inference2-llama3_1_405b.sh: line 11: 141226 Killed ./dllama inference --model models/llama3_1_405b_instruct_q40/dllama_model_llama3_1_405b_instruct_q40.m --tokenizer models/llama3_1_405b_instruct_q40/dllama_tokenizer_llama3_1_405b_instruct_q40.t --buffer-float-type q80 --prompt "$@" --steps 64 --nthreads 4 --workers 192.168.16.52:9999 192.168.16.62:9999 192.168.16.72:9999 --kv-cache-storage disk

!/bin/bash

cd /usr/local/distributed-llama/ ./dllama inference \ --model models/llama3_1_405b_instruct_q40/dllama_model_llama3_1_405b_instruct_q40.m \ --tokenizer models/llama3_1_405b_instruct_q40/dllama_tokenizer_llama3_1_405b_instruct_q40.t \ --buffer-float-type q80 \ --prompt "$@" \ --steps 64 \ --nthreads 4 \ --workers 192.168.16.52:9999 192.168.16.62:9999 192.168.16.72:9999 \ --kv-cache-storage disk /usr/bin/pdsh -w root@edgenode[5-7] /opt/dllama-start.sh

!/bin/bash

cd /root/ myhost=$(/bin/cat /etc/hostname | /usr/bin/tail -n 1 | /usr/bin/tr -d '\r\n') /usr/local/distributed-llama/dllama worker --port 9999 --nthreads 8 --kv-cache-storage disk > /apps/logs/$myhost-dllama.log 2>&1

A few questions :

Where are you on CUDA ??? I have seen a lot of C++ code, so I bet you simply did not have the time to convert that ??
The --text prompt for reading a text file is missing, can we add it ?
Are Open WebUI compatible with dllama-api ?? If yes, can you give me the procedure ??
Instead of a JS script, do you have a Python script for exemple, using Open.AI API module, or Ollama.API module ?

Thanks in advance. Best Regards. Benjamin.

b4rtaz commented 3 months ago

@lipere123 try to run 405b with smaller context: --max-seq-len 1024.

The worker are shutding down after inference, so I have to restart them for now. Is that a bug ??

Yes, this could be improved. Now the main goal for the dllama inference command is benchmark. To use DL for anything else I would recommend dllama-api.

Where are you on CUDA ?

Not started.

The --text prompt for reading a text file is missing, can we add it ?

Feel free to create PR.

Are Open WebUI compatible with dllama-api ?? If yes, can you give me the procedure ??

I don't know, but dllama-api has the same format as OpenAi. To run dllama-api you should use the same arguments as for the dllama inference but you should run dllama-api ....

Instead of a JS script, do you have a Python script for exemple, using Open.AI API module, or Ollama.API module ?

Nop.

b4rtaz / distributed-llama

Segmentation Fault #105

!/bin/bash

End of script

!/bin/bash

End of script

!/bin/bash

!/bin/bash

!/bin/bash

!/bin/bash

!/bin/bash