Closed sa1utyeggs closed 9 months ago
Hi @sa1utyeggs,
Please use the original model in the Hugging Face format: WizardLM/WizardCoder-Python-34B-V1.0
Petals will automatically quantize the model to 4.5-bit (the NF4 format from the QLoRA paper) after downloading. In practice, the NF4 format is close to GPTQ in terms of efficiency.
Thank you for your answer, But a whole model file is too large to transfer through the Internet. I wonder is there any chance to divide the whole model file into multiple block files. So, the transportation will be faster and eazier. Then, every port can only store the blocks they need to economize their storage. Forwarding to hear your answer. :)
@sa1utyeggs,
This model is already divided into multiple files, and Petals servers/clients will only download the necessary parts:
Note that these files store the weights in 16-bit, and their total size is ~4x larger than the model size in the repo you suggested. Unfortunately, that's necessary for now since Petals can't load a model that is already quantized (we're working on changing that).
hello, I'm trying to run a quantized llama model named TheBloke/WizardCoder-Python-34B-V1.0-GPTQ (here is the huggingface url: https://huggingface.co/TheBloke/WizardCoder-Python-34B-V1.0-GPTQ).
And it is based on llama model, I tried to run it on petals, but error occured:![image](https://github.com/bigscience-workshop/petals/assets/63895563/cef7e63f-50ad-44e3-a0d8-3faee34db22e)
The following is the startup command: python -m petals.cli.run_server /data/chat/models/WizardCoder-Python-34B-V1.0-GPTQ --dht_prefix WizardCoder-Python-34B-GPTQ --initial_peers $INITIAL_PEERS --num_blocks 10 --public_name V100-WizardCoder-Python-34B-GPTQ
looking for help. : )