casper-hansen / AutoAWQ

AutoAWQ implements the AWQ algorithm for 4-bit quantization with a 2x speedup during inference. Documentation:
https://casper-hansen.github.io/AutoAWQ/
MIT License
1.76k stars 209 forks source link

Bug - mixtral qlora(after b&b peft train) quantization broadcast problem #440

Open orellavie1212 opened 7 months ago

orellavie1212 commented 7 months ago

Hello, Wanted to quantize the model via awq after a merged qlora b&b nf4 mixtral moe.

the error is:

self._search_best_scale(self.modules[i], **layer)
  File "/home/access/anaconda3/envs/sec_qlora_replicate/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/home/access/anaconda3/envs/sec_qlora_replicate/lib/python3.10/site-packages/awq/quantize/quantizer.py", line 274, in _search_best_scale
    best_scales = self._compute_best_scale(
  File "/home/access/anaconda3/envs/sec_qlora_replicate/lib/python3.10/site-packages/awq/quantize/quantizer.py", line 329, in _compute_best_scale
    fc.weight.mul_(scales_view)
RuntimeError: output with shape [8388608, 1] doesn't match the broadcast shape [8388608, 4096]

looks like a broadcast problem with the data.

here's the quantize profile

    from awq import AutoAWQForCausalLM
    from transformers import AutoTokenizer
    quant_config = {"zero_point": True, "q_group_size": 128, "w_bit": 4, "version": "GEMM",
                    "modules_to_not_convert": ["gate"]}

    # Load model
    model = AutoAWQForCausalLM.from_pretrained(
        model_id, local_files_only=True, ignore_mismatched_sizes=True,
        **{"low_cpu_mem_usage": True, "use_cache": False}
    )
    tokenizer = AutoTokenizer.from_pretrained(model_id, local_files_only=True)
    # Quantize
    model.quantize(tokenizer, quant_config=quant_config)

based on: https://huggingface.co/casperhansen/mixtral-instruct-awq

Here's the req.txt

pandas==2.2.1
torch>=2.2.0
torchvision>=0.17.0
transformers==4.39.3
deepspeed==0.14.0
accelerate==0.29.1
trl==0.8.1
peft==0.10.0
tqdm==4.66.2
datasets==2.18.0
flash-attn==2.5.6
optimum>=1.18.0
auto-gptq>=0.7.1
bitsandbytes>=0.43.0
evaluate
git+https://github.com/casper-hansen/AutoAWQ.git.git@e9f62694a867a7a0b2f5e469fcbd914ce5ae0970

tried autoawq with latest commit e9f62694a867a7a0b2f5e469fcbd914ce5ae0970 because of transformers 4.39.3.. Also tried version 0.2.4 from pypi with transformers 4.38.x without success

the lora profile is standard (without any special w1,2,... layers, regular qkvo (tried before with w1,w2,w3, same error)

    lora_r = 8
    lora_alpha = 2 * lora_r
    lora_dropout = 0.1
    config = LoraConfig(r=lora_r,
                        lora_alpha=lora_alpha,
                        target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
                        bias="none",
                        task_type="CAUSAL_LM",
                        lora_dropout=lora_dropout
                        )

@casper-hansen it is also based on the assumption this guy succeeded doing it https://huggingface.co/Nondzu/Mistral-7B-code-16k-qlora he had gptq and awq successfully after qlora b&b

casper-hansen commented 7 months ago

We don't support lora modules. You would have to convert your model to standard weights

orellavie1212 commented 7 months ago

We don't support lora modules. You would have to convert your model to standard weights

Let's say the standard weights without the adapter are working, what then can I do? I feel like it is a dead end.. How fof this guy this is working, I don't know https://huggingface.co/Nondzu/Mistral-7B-code-16k-qlora

orellavie1212 commented 7 months ago

@casper-hansen btw, tried to fix the code of main today, but faced a problem I am not sure how to handle. in quantizer.py and scaler.py, For example: https://github.com/casper-hansen/AutoAWQ/blob/main/awq/quantize/quantizer.py#L332 You assume that scaler could be multiplied by fc.weights, but when the layer Linear4bit fc coming as input, its shape is (xxx, 1), while the shape of scalers is 4096 for example (without the view). Wondered if the right way to fix it is to torch repeat the scalers to the dim of fc, or just skip layers Linear4bit (layers with names 4 bit, which came from b&b quan.) at all. The same thing happen in many places of the code and in scaler.py too. So I wonder if to do that to all (there are weights and biases ofc) Also, you assume inside some functions, like here: https://github.com/casper-hansen/AutoAWQ/blob/main/awq/quantize/quantizer.py#L66

That the right dim to take is -1 and not 0 (batch), but in cases where thee are 2 dims only, it could be that -1 is the right dim. Wondered what to do (could implement both, not sure which one is more right)

This is the merged model layers' archi.: MixtralForCausalLM( (model): MixtralModel( (embed_tokens): Embedding(32000, 4096) (layers): ModuleList( (0-31): 32 x MixtralDecoderLayer( (self_attn): MixtralFlashAttention2( (q_proj): Linear4bit(in_features=4096, out_features=4096, bias=False) (k_proj): Linear4bit(in_features=4096, out_features=1024, bias=False) (v_proj): Linear4bit(in_features=4096, out_features=1024, bias=False) (o_proj): Linear4bit(in_features=4096, out_features=4096, bias=False) (rotary_emb): MixtralRotaryEmbedding() ) (block_sparse_moe): MixtralSparseMoeBlock( (gate): Linear4bit(in_features=4096, out_features=8, bias=False) (experts): ModuleList( (0-7): 8 x MixtralBlockSparseTop2MLP( (w1): Linear4bit(in_features=4096, out_features=14336, bias=False) (w2): Linear4bit(in_features=14336, out_features=4096, bias=False) (w3): Linear4bit(in_features=4096, out_features=14336, bias=False) (act_fn): SiLU() ) ) ) (input_layernorm): MixtralRMSNorm() (post_attention_layernorm): MixtralRMSNorm() ) ) (norm): MixtralRMSNorm() ) (lm_head): Linear(in_features=4096, out_features=32000, bias=False) ) @casper-hansen