Can petals run a quantized (use GPTQ) llama model ?

bigscience-workshop / petals

🌸 Run LLMs at home, BitTorrent-style. Fine-tuning and inference up to 10x faster than offloading

https://petals.dev

MIT License

8.89k stars 489 forks source link

Can petals run a quantized (use GPTQ) llama model ? #514

Closed sa1utyeggs closed 9 months ago

sa1utyeggs commented 9 months ago

hello, I'm trying to run a quantized llama model named TheBloke/WizardCoder-Python-34B-V1.0-GPTQ (here is the huggingface url: https://huggingface.co/TheBloke/WizardCoder-Python-34B-V1.0-GPTQ).

And it is based on llama model, I tried to run it on petals, but error occured:

The following is the startup command： python -m petals.cli.run_server /data/chat/models/WizardCoder-Python-34B-V1.0-GPTQ --dht_prefix WizardCoder-Python-34B-GPTQ --initial_peers $INITIAL_PEERS --num_blocks 10 --public_name V100-WizardCoder-Python-34B-GPTQ

looking for help. : )

borzunov commented 9 months ago

Hi @sa1utyeggs,

Please use the original model in the Hugging Face format: WizardLM/WizardCoder-Python-34B-V1.0

Petals will automatically quantize the model to 4.5-bit (the NF4 format from the QLoRA paper) after downloading. In practice, the NF4 format is close to GPTQ in terms of efficiency.

sa1utyeggs commented 9 months ago

Thank you for your answer, But a whole model file is too large to transfer through the Internet. I wonder is there any chance to divide the whole model file into multiple block files. So, the transportation will be faster and eazier. Then, every port can only store the blocks they need to economize their storage. Forwarding to hear your answer. :)

borzunov commented 9 months ago

@sa1utyeggs,

This model is already divided into multiple files, and Petals servers/clients will only download the necessary parts:

Note that these files store the weights in 16-bit, and their total size is ~4x larger than the model size in the repo you suggested. Unfortunately, that's necessary for now since Petals can't load a model that is already quantized (we're working on changing that).