b4rtaz / distributed-llama

Tensor parallelism is all you need. Run LLMs on weak devices or make powerful devices even more powerful by distributing the workload and dividing the RAM usage.
MIT License
1.02k stars 68 forks source link

dllama-api invokes "what(): Invalid tokenizer file " #86

Closed unclemusclez closed 1 month ago

unclemusclez commented 1 month ago
terminate called after throwing an instance of 'std::runtime_error'
  what():  Invalid tokenizer file
Aborted

dllama chat works fine:

ubuntu@ubuntu:~/distributed-llama$ sudo nice -n -20 ./dllama chat --model models/TinyLlama-1.1B-Chat-v1.0/dllama_model_TinyLlama-1.1B-Chat-v1.0_q40.m   --tokenizer models/TinyLlama-1.1B-Chat-v1.0//dllama_tokenizer_TinyLlama-1.1B-Chat-v1.0.t  --weights-float-type q40 --buffer-float-type q80 --nthreads 4  --workers 192.168.2.212:9998 192.168.2.213:9998 192.168.2.214:9998
💡 arch: llama
💡 hiddenAct: silu
💡 dim: 2048
💡 hiddenDim: 5632
💡 nLayers: 22
💡 nHeads: 32
💡 nKvHeads: 4
💡 vocabSize: 32000
💡 seqLen: 2048
💡 nSlices: 4
💡 ropeTheta: 10000.0
📄 bosId: 1
📄 eosId: 2
📄 chatEosId: 2
🕒 ropeCache: 4096 kB
⏩ Loaded 824584 kB
⭐ chat template: zephyr
🛑 stop: </s>
💻 System prompt (optional):
b4rtaz commented 1 month ago

Have you rebuild 2 applications?

make dllama
make dllama-api
unclemusclez commented 1 month ago

... i remade dllama not dllama-api 🥸