Open dot-ammar opened 4 months ago
Hmm... weird. I cannot reproduce it on my mac (root node + 3 workers, model: dllama_model_llama3_8b_instruct_q40.m
, tokenizer: models/llama3_8b_instruct_q40/dllama_tokenizer_llama3_8b_instruct_q40.t
).
Could you try to run DL only on intel mac with 16 gb? You could try to check 2 configurations: 1 root node only
and 1 root node + 3 workers on the same device
.
Yeah that produces the same issue
I was able to test it with a different device as the root node and it worked fine, might be an issue with some corporate setup. The new root node also has more memory, could be the reason. If I test it with another different device with memory similar to my mac I'll update you guys.
Thanks for the help though. This is a really awesome tool btw, wonder how it will perform for llama3.1 401b π
I met the same issue days ago. And segmentation fault is usually caused by non-correct pointer. So you may use gdb and ulimit setting to locate the wrong pointer's pos. Or you can use printf to locate it also. It's not a problem caused by a specific error, but any wrong pointers. My error is a pointer out of buffer. :-) Use ulimit -c 1000 and some settings will produce a core file when meet segmentation fault and you can use gdb to read this file to find the pos of wrong pointer. Good luck
I wanted to check if this problem occurs on Linux, but it seems no.
@b4rtaz β /workspaces/distributed-llama (main) $ uname -a
Linux codespaces-32fe28 6.5.0-1022-azure #23~22.04.1-Ubuntu SMP Thu May 9 17:59:24 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
Llama 3.1 8B Q40:
β© Loaded 6175568 kB
β chat template: llama3
π stop: <|eot_id|>
π stop: <|end_of_text|>
π» System prompt (optional):
π± User
> hello
π€ Assistant
hello back! How can I assist you today?
π± User
> where is the largest volcano?
π€ Assistant
that's a big question! There are many large volcanoes around the world, but some of the largest and most well-known ones include:ΔΔMauna Loa, Hawaii, USA: It's the largest volcano on Earth in terms of size and volume, with a total volume of around 75,000 cubic kilometers.ΔΔMauna Kea, Hawaii, USA: This shield volcano is the tallest mountain in the world, with a height of 4,207 meters
Llama 3 8B Q40:
β© Loaded 6175568 kB
β chat template: llama3
π stop: <|eot_id|>
π» System prompt (optional):
π± User
> hello
π€ Assistant
Hello! I'm happy to help you with any questions or topics you'd like to discuss. What's on your mind today?
π± User
> what is 1+4?
π€ Assistant
That's an easy one! The answer is 5.
Definitely it's not a simple case.
I suspect the EosDetector
class. @dot-ammar are you able to compile and run tokenizer-test
?
make tokenizer-test
./tokenizer-test
EDIT: Meantime I found a tiny bug, maybe it was related.
Hello. I have the issue : terminate called after throwing an instance of 'ReadSocketException' what(): std::exception
Frankly, I think that is because I have disabled IPv6. What your point of view, please ?
@lipere123 could you please provide a bit more context? What model are you trying to run, and how much RAM does each device have?
Hello.
Thank you very much for your quick answer. I have a supercomputer, 6 nodes, 96Cores, 768Go RAM, 6 PNY Nvidia RTX 4000 Ada Generation. I run Ubuntu 24.04 on the host, and the cluster is on LXC Ubuntu 22.04.
Network is a passthrouh like that : eth1: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 7000 inet 192.168.16.22 netmask 255.255.255.0 broadcast 192.168.16.255 ether e8:ea:6a:03:3a:06 txqueuelen 1000 (Ethernet) RX packets 220318748 bytes 319104638952 (319.1 GB) RX errors 1 dropped 6395 overruns 0 frame 1 TX packets 30638671 bytes 8062889108 (8.0 GB) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
The distributed storage is ceph, mounted via hostpath LXC, ceph-cli will comme with an update version of my infrastructure code when I have the time. The models are linked to a folder on ceph via ln.
I have a minimum install on the master. Here the scripts : install_dllama_master.sh
cd /usr/local/ /bin/rm -Rf /usr/local/distributed-llama/ /usr/local/bin/dllama /usr/bin/git clone https://github.com/b4rtaz/distributed-llama.git cd /usr/local/distributed-llama/ /usr/bin/make dllama /bin/sleep 2 /bin/ln -s /usr/local/distributed-llama/dllama /usr/local/bin/dllama /bin/ln -s /lxdubu/share/dllama-models/ /usr/local/distributed-llama/models
exit 0
install_dllama.sh
cd /usr/local/ /bin/rm -Rf /usr/local/distributed-llama/ /usr/local/bin/dllama /usr/bin/git clone https://github.com/b4rtaz/distributed-llama.git cd /usr/local/distributed-llama/ /usr/bin/make dllama /bin/sleep 2 /bin/ln -s /usr/local/distributed-llama/dllama /usr/local/bin/dllama /bin/cp /share/llama/dllama-start.sh /opt/dllama-start.sh /bin/chmod +x /opt/dllama-start.sh /bin/sleep 2 /bin/rm -Rf /share/dllama-models/ /usr/local/distributed-llama/models/ /bin/mkdir -p /share/dllama-models/ /bin/ln -s /share/dllama-models/ /usr/local/distributed-llama/models
exit 0
dllama-models.sh
cd /usr/local/distributed-llama/ /usr/local/llm/bin/python3 /usr/local/distributed-llama/launch.py llama3_1_8b_instruct_q40 /usr/local/llm/bin/python3 /usr/local/distributed-llama/launch.py llama3_1_405b_instruct_q40
dllama-inference-llama3_1_8b.sh
cd /usr/local/distributed-llama/ ./dllama inference \ --model models/llama3_1_8b_instruct_q40/dllama_model_llama3_1_8b_instruct_q40.m \ --tokenizer models/llama3_1_8b_instruct_q40/dllama_tokenizer_llama3_1_8b_instruct_q40.t \ --buffer-float-type q80 \ --prompt "Hello world" \ --steps 64 \ --nthreads 4 \ --workers 192.168.16.22:9998 192.168.16.32:9998 192.168.16.42:9998 192.168.16.52:9998 192.168.16.62:9998 192.168.16.72:9998
dllama-run.sh
cd /root/ myhost=$(/bin/cat /etc/hostname | /usr/bin/tail -n 1 | /usr/bin/tr -d '\r\n') /usr/local/distributed-llama/dllama worker --port 9998 --nthreads 8 > /apps/logs/$myhost-dllama.log 2>&1
./dllama-inference-llama3_1_8b.sh π‘ arch: llama π‘ hiddenAct: silu π‘ dim: 4096 π‘ hiddenDim: 14336 π‘ nLayers: 32 π‘ nHeads: 32 π‘ nKvHeads: 8 π‘ vocabSize: 128256 π‘ seqLen: 131072 π‘ nSlices: 7 π‘ ropeTheta: 500000.0 π bosId: 128000 π eosId: 128009 π chatEosId: 128009 dllama: src/commands.cpp:98: KvCacheSlice::KvCacheSlice(unsigned int, unsigned int, unsigned int): Assertion `kvDim % nSlices == 0' failed. ./dllama-inference-llama3_1_8b.sh: line 10: 2867142 Aborted (core dumped) ./dllama inference --model models/llama3_1_8b_instruct_q40/dllama_model_llama3_1_8b_instruct_q40.m --tokenizer models/llama3_1_8b_instruct_q40/dllama_tokenizer_llama3_1_8b_instruct_q40.t --buffer-float-type q80 --prompt "Hello world" --steps 64 --nthreads 4 --workers 192.168.16.22:9998 192.168.16.32:9998 192.168.16.42:9998 192.168.16.52:9998 192.168.16.62:9998 192.168.16.72:9998
On my workers log : terminate called after throwing an instance of 'ReadSocketException' what(): std::exception
Thanks again. Best Regards. Benjamin.
@lipere123 in this case I think the reason is that you are trying to run 7 nodes:
dllama: src/commands.cpp:98: KvCacheSlice::KvCacheSlice(unsigned int, unsigned int, unsigned int): Assertion 'kvDim % nSlices == 0' failed.
Distriubted Llama supports 1, 2, 4, 8, 16... (max is equal nKvHeads)
nodes. So you should try with 4 (1 root + 3 workers) or 8 (1 root + 7 workers) nodes.
Hello.
Okay, now it is working for 8b. The worker are shutding down after inference, so I have to restart them for now. Is that a bug ??
Also for 405b : π‘ arch: llama π‘ hiddenAct: silu π‘ dim: 16384 π‘ hiddenDim: 53248 π‘ nLayers: 126 π‘ nHeads: 128 π‘ nKvHeads: 16 π‘ vocabSize: 128256 π‘ seqLen: 131072 π‘ nSlices: 4 π‘ ropeTheta: 500000.0 π bosId: 128000 π eosId: 128009 π chatEosId: 128009 ./dllama-inference2-llama3_1_405b.sh: line 11: 141226 Killed ./dllama inference --model models/llama3_1_405b_instruct_q40/dllama_model_llama3_1_405b_instruct_q40.m --tokenizer models/llama3_1_405b_instruct_q40/dllama_tokenizer_llama3_1_405b_instruct_q40.t --buffer-float-type q80 --prompt "$@" --steps 64 --nthreads 4 --workers 192.168.16.52:9999 192.168.16.62:9999 192.168.16.72:9999 --kv-cache-storage disk
cd /usr/local/distributed-llama/ ./dllama inference \ --model models/llama3_1_405b_instruct_q40/dllama_model_llama3_1_405b_instruct_q40.m \ --tokenizer models/llama3_1_405b_instruct_q40/dllama_tokenizer_llama3_1_405b_instruct_q40.t \ --buffer-float-type q80 \ --prompt "$@" \ --steps 64 \ --nthreads 4 \ --workers 192.168.16.52:9999 192.168.16.62:9999 192.168.16.72:9999 \ --kv-cache-storage disk /usr/bin/pdsh -w root@edgenode[5-7] /opt/dllama-start.sh
cd /root/ myhost=$(/bin/cat /etc/hostname | /usr/bin/tail -n 1 | /usr/bin/tr -d '\r\n') /usr/local/distributed-llama/dllama worker --port 9999 --nthreads 8 --kv-cache-storage disk > /apps/logs/$myhost-dllama.log 2>&1
A few questions :
Thanks in advance. Best Regards. Benjamin.
@lipere123 try to run 405b with smaller context: --max-seq-len 1024
.
The worker are shutding down after inference, so I have to restart them for now. Is that a bug ??
Yes, this could be improved. Now the main goal for the dllama inference
command is benchmark. To use DL for anything else I would recommend dllama-api
.
Where are you on CUDA ?
Not started.
The --text prompt for reading a text file is missing, can we add it ?
Feel free to create PR.
Are Open WebUI compatible with dllama-api ?? If yes, can you give me the procedure ??
I don't know, but dllama-api
has the same format as OpenAi. To run dllama-api
you should use the same arguments as for the dllama inference
but you should run dllama-api ...
.
Instead of a JS script, do you have a Python script for exemple, using Open.AI API module, or Ollama.API module ?
Nop.
With Llama38b, inference works, however api and chat do not. It produces a segmentation fault.
Workers terminate like:
I'm using 3 linux machines with 8gb of ram each, and an intel mac with 16gb as the root node.
This all works fine with tiny llama though.