Closed maziyarpanahi closed 2 months ago
CUDA_VISIBLE_DEVICES=0 python quantize.py
I quantized Llama 3 8B on a single 4090. Additionally, Llama 3 70B was quantized with multiple 48 GB GPUs. I’m not sure how to reproduce this as I didn’t experience the same error
Closing this as Llama 3 is definitely already supported. Using the examples/quantize.py
worked without any modifications in the first try for both sizes of the model.
https://huggingface.co/casperhansen/llama-3-8b-instruct-awq https://huggingface.co/casperhansen/llama-3-70b-instruct-awq
It's not a question of vRAM, mine failed on 4 A100/80G. Usually, quantizing models via AWQ uses multiple GPUs and it's quick this way. I am not sure CUDA_VISIBLE_DEVICES=0
would be as fast, but it's a workaround. Thank you.
PS: I use AWQ via huggingface transformers, so it is possible there is something in there too.
I am not able to quantized these new Llama-3 models: