b4rtaz / distributed-llama

Tensor parallelism is all you need. Run LLMs on an AI cluster at home using any device. Distribute the workload, divide RAM usage, and increase inference speed.
MIT License
1.41k stars 94 forks source link

dllama: src/commands.cpp:102: MultiHeadAttSlice::MultiHeadAttSlice(unsigned int, unsigned int, unsigned int, slice_index_t): Assertion `nHeads % nSlices == 0' failed. #98

Open EntusiastaIApy opened 3 months ago

EntusiastaIApy commented 3 months ago

Hello, @b4rtaz!

I'm trying to run model nkpz/llama2-22b-chat-wizard-uncensored on a cluster composed of 1 Raspberry Pi 4B 8 Gb and 7 Raspberry Pi 4B 4 Gb, but, both on inference and chat modes, distributed llama throws the following error. Do you know why this is happening and how to fix it?

llama2-22b-chat-wizard-uncensored_q40_8nodes_switch_sdcard_inference-error

b4rtaz commented 2 months ago

Hello @EntusiastaIApy,

I think the problem is that: "num_attention_heads": 52 The current implementation expects that this number can be divided by the number of nodes without remainder.

52 / 8 => 6 remainder 4

This is basically a bug.

Different-Pranav commented 3 weeks ago

I am facing a similar kind of issue. I am trying to run TinyLlama in the dllama environment. I am using 2 worker nodes of 8 GB ram each but it throws a similar kind of error. Screenshot 2024-09-13 200436

b4rtaz commented 3 weeks ago

@Different-Pranav you are using 3 nodes (root + 2 workers). You should try with 2 nodes (1 root + 1 worker) or 4 nodes (1 root + 3 workers).