b4rtaz / distributed-llama

Tensor parallelism is all you need. Run LLMs on weak devices or make powerful devices even more powerful by distributing the workload and dividing the RAM usage.
MIT License
1.04k stars 69 forks source link

what(): The tokenizer does not include chat template #97

Closed EntusiastaIApy closed 2 days ago

EntusiastaIApy commented 1 week ago

Hello, @b4rtaz!

I'm running distributed llama on a cluster composed of 1 Raspberry Pi 4B 8 Gb and 7 Raspberry Pi 4B 4 Gb.

I've successfully converted and ran model ajibawa-2023/Uncensored-Jordan-13B on inference mode, obtaining the following results.

Uncensored-Jordan-13B_q40_8nodes_switch_sdcard

But I'm not able to run the same model on chat mode, as it throws the following error. Is there a way to get around this error so I can use this model on chat mode?

Uncensored-Jordan-13B_q40_8nodes_switch_sdcard_chat-error

b4rtaz commented 2 days ago

Hello @EntusiastaIApy, you can try to use a new CLI argument: --chat-template (from 0.9.2).

Usage:

./dllama-api ... --chat-template llama3