Vahe1994 / AQLM

Official Pytorch repository for Extreme Compression of Large Language Models via Additive Quantization https://arxiv.org/pdf/2401.06118.pdf and PV-Tuning: Beyond Straight-Through Estimation for Extreme LLM Compression https://arxiv.org/abs/2405.14852
Apache License 2.0
1.14k stars 173 forks source link

Request for Nvidia's RAG Implementation of Llama-3-70B "ChatQA 1.5" #94

Closed BuildBackBuehler closed 2 months ago

BuildBackBuehler commented 4 months ago

I'd love to see this 'un as an AQLM 2-bit 1x16! I imagine it'd be a popular option for people -- at least it'd be the most logical solution for a RAG in my mind, for anyone with "limited" VRAM.

I am TERRIBLY "limited" with 64GB VRAM and figure with a 2-bit RAG + 4-bit 70B-Instruct I'll actually be able to continue using my PC without issues/hiccups 😂...or on 2nd thought I'd swap out the 4-bit with a 2-bit WizardLM2-8x22B hint hint plz wink wink plz cough cough 🤪

Of course, with how long it takes & a handful of new models to 2-bitify, introduce into the AQLM ecosystem, I'd understand this one being a low-priority addition to the mix

https://huggingface.co/nvidia/Llama3-ChatQA-1.5-70B

justheuristic commented 3 months ago

Hi! Sorry for not responding to you for so long. Unfortunately, we are also quite limited on compute, but I've added this to the queue of models to quantize eventually. Right now we're processing phi-3 medium, then qwen2, then this

BuildBackBuehler commented 3 months ago

Hi! Sorry for not responding to you for so long. Unfortunately, we are also quite limited on compute, but I've added this to the queue of models to quantize eventually. Right now we're processing phi-3 medium, then qwen2, then this

Hi! Sorry for not responding to you for so long. Unfortunately, we are also quite limited on compute, but I've added this to the queue of models to quantize eventually. Right now we're processing phi-3 medium, then qwen2, then this

Not too bad of a wait! Worth being patient =)

Totally OT, but I'd like to (try to, at least) run a model with MLC-LLM and I'm wondering what the equivalent would be to these Quant. Settings.

https://github.com/mlc-ai/mlc-llm/blob/main/python/mlc_llm/quantization/quantization.py

I was figuring something like (for Llama-3-70B 2-bit 1x16, for ex.):

        "AQLM_2bit": GroupQuantize(
        name="AQLM_2bit",
        kind="group-quant",
        group_size=16,
        quantize_dtype="int2",
        storage_dtype="uint32",
        model_dtype="float16",
        linear_weight_layout="NK",

But I donno, with the group in vs. out, it needs to be divisible by the quant. dtype IIRC. So perhaps group_size=8 and without an equivalent...adding: nbits_per_codebook=16 num_codebooks=1

justheuristic commented 3 months ago

Yes, you would indeed need group size 8.

1x 16-bit code per 8 weights gives you 2 bits per weight, plus some extra bits for the codebook itself.

I've specified example training script for this exact case (llama-3-70B derivative, 2-ish bits per weight) here: https://github.com/Vahe1994/AQLM/issues/98#issuecomment-2158750539

BuildBackBuehler commented 3 months ago

Hi! Sorry for not responding to you for so long. Unfortunately, we are also quite limited on compute, but I've added this to the queue of models to quantize eventually. Right now we're processing phi-3 medium, then qwen2, then this

I was also interested in Wizard 8x22B @ 2-bit and there was a thread created with regards to its base, Mixtral 8x22B. Any reason why one of 'em isn't on the to-do list? May be overly presumptuous, but does it have to do with MoE? I see there is an AQLM Mixtral, but I'm guessing MoEs don't perform as well as a standard AQLM quant model.

And thank you for the rapid responses and that comprehensive example script, super clutch! Shall I go ahead and close this or keep it open until the model is up and running?

Edit: Just noticed the PV-tuning models and a small FYI -- looks like the Llama 3-70B models are both linked to the same model.

AKA Meta-Llama-3-70B | 1x16g16's link is the one needing correction

justheuristic commented 3 months ago

Hi!

May be overly presumptuous, but does it have to do with MoE? I see there is an AQLM Mixtral, but I'm guessing MoEs don't perform as well as a standard AQLM quant model.

We have quantized Moe in the past, e.g. that mixtral model you mentioned. To the best of my knowledge, MoE models quantize roughly as well as non-MoE ones.

Any reason why one of 'em isn't on the to-do list?

There is no such reason, but unfortunately, this is not how it works.

We have very limited compute and "manpower", and what we have is split between research and model quantization. The research comes first because if we don't do that quickly enough, the lab will be shut down.

So, our "gpu budget" for releasing quantized models is only enough to quantize few most popular models like Llama and Phi 3. For the rest, we release the code in hope that volunteers will run that and publish their own pre-quantized models.

As such, we hope we can reach ChatQA or Wizard eventually, but we can't guarantee anything. If tomorrow we find out we need those GPUs to get our research done in time, the extra models will have to wait.

the Llama 3-70B models are both linked to the same model.

Please specify which ones? (a link would be nice)

BuildBackBuehler commented 3 months ago

Hi!

May be overly presumptuous, but does it have to do with MoE? I see there is an AQLM Mixtral, but I'm guessing MoEs don't perform as well as a standard AQLM quant model.

We have quantized Moe in the past, e.g. that mixtral model you mentioned. To the best of my knowledge, MoE models quantize roughly as well as non-MoE ones.

Any reason why one of 'em isn't on the to-do list?

There is no such reason, but unfortunately, this is not how it works.

We have very limited compute and "manpower", and what we have is split between research and model quantization. The research comes first because if we don't do that quickly enough, the lab will be shut down.

So, our "gpu budget" for releasing quantized models is only enough to quantize few most popular models like Llama and Phi 3. For the rest, we release the code in hope that volunteers will run that and publish their own pre-quantized models.

As such, we hope we can reach ChatQA or Wizard eventually, but we can't guarantee anything. If tomorrow we find out we need those GPUs to get our research done in time, the extra models will have to wait.

the Llama 3-70B models are both linked to the same model.

Please specify which ones? (a link would be nice)

Hah aw, on one hand I know how competitive academic research funding can be so it makes sense, on the other, I'm surprised with such a cutting-edge field and producing the SotA Quant. that ISTA doesn't have a hefty grant for y'all (=

I wish I had more computing power. I haven't read it in awhile, but it sounded like anything less than an A100 is SOL. I'm using an M1 Max w/ 64GB VRAM, I'd be fine if it were compiling in the background for days, but I imagine if I didn't OOM for a 70B or 8x22B model, it'd take a month 😂. Unless its setup for multi-computer quantization but that'd be a mess and half of what I'd need in power I imagine (RTX3070/Tesla M40 24GB)

image

https://huggingface.co/ISTA-DASLab/Meta-Llama-3-70B-AQLM-PV-2Bit-1x16 It's the model size of 13GB 1 that also has this as its link (3rd from the bottom). It appears that the model in question is not up on HuggingFace so maybe that was an intentional placehold?

justheuristic commented 3 months ago

Good catch! I fixed the link in this commit, with an acknowledgement.

@setup

Your setup would be enough to quantize smaller LLMs (e.g. 7B, maybe 13B with basic tweaking), but you are right that it would probably OOM for 70B+ models

github-actions[bot] commented 2 months ago

This issue is stale because it has been open for 30 days with no activity.

github-actions[bot] commented 2 months ago

This issue was closed because it has been inactive for 14 days since being marked as stale.