b4rtaz / distributed-llama

Tensor parallelism is all you need. Run LLMs on weak devices or make powerful devices even more powerful by distributing the workload and dividing the RAM usage.
MIT License
1.02k stars 68 forks source link

Hugging Face models without tokenizer.model file #93

Closed EntusiastaIApy closed 5 days ago

EntusiastaIApy commented 1 week ago

Hey, @b4rtaz! First of all, thanks for this amazing work! I've been exploring the functionalities of distributed llama for a while and all works great when I run your pre-converted models downloaded from https://huggingface.co/b4rtaz on my 8x raspberry pi 4 Gb cluster.

The problem is when I try to convert and run models from HF myself. Even though your instructions state that there should be a tokenizer.model file in the HF model folder, I've noticed that many HF models don't come with such file (for example: https://huggingface.co/Dogge/llama-3-70B-instruct-uncensored). Even though this model doesn't have a tokenizer.model file, I've succeeded in generating a dllama_model.m and a dllama_tokenizer.t files from it, using your Python scripts. Nevertheless, I can't get the model to run. Distributed llama just throws a "Killed" and "terminate called after throwing an instance of 'ReadSocketException'" error.

Is the problem the abscence of the tokenizer.model file? If so, is it possible to generate it?

I'm sorry if this is a newbie question, but I am, indeed, a newbie in LLM/AI stuff. I'd really apreciatte if you could point out what I am doing wrong.

b4rtaz commented 6 days ago

Hello, I'm not sure if I understand. The model that you pasted has the tokenizer here. The convert-tokenizer-hf.py file supports two types of tokenizer files, one of them is tokenizer.json format. So it may work correctly.

Nevertheless, I can't get the model to run. Distributed llama just throws a "Killed" and "terminate called after throwing an instance of 'ReadSocketException'" error.

Are you trying to run 70B on 8x Raspberry Pi 4GB devices? This gives 32 GB of RAM. Llama 2 70B Q40 is 36.98 GB model, so I think your setup may have too less RAM.

EntusiastaIApy commented 6 days ago

Hello, I'm not sure if I understand. The model that you pasted has the tokenizer here. The convert-tokenizer-hf.py file supports two types of tokenizer files, one of them is tokenizer.json format. So it may work correctly.

I guess I misunderstood the instructions to convert an HF model to Distributed Llama format (https://github.com/b4rtaz/distributed-llama/blob/main/docs/HUGGINGFACE.md). And, as I couldn't get the converted model to run, I thought the problem was that maybe the model should contain both tokenizer.model and tokenizer.json files. But I see my mistake now.

Are you trying to run 70B on 8x Raspberry Pi 4GB devices? This gives 32 GB of RAM. Llama 2 70B Q40 is 36.98 GB model, so I think your setup may have too less RAM.

And, yes, although I'm actually running a raspberry pi 4B cluster with one 8 Gb unit and seven 4 Gb units (I made a mistake in the previous cluster description), I now understand it still has not enough RAM for a 70B model. I've just successfully converted and ran a smaller model from HF, so the problem with the previous model must really have been the lack of RAM.

So, thanks for your time and thanks again for this great project!