Open orellavie1212 opened 7 months ago
We don't support lora modules. You would have to convert your model to standard weights
We don't support lora modules. You would have to convert your model to standard weights
Let's say the standard weights without the adapter are working, what then can I do? I feel like it is a dead end.. How fof this guy this is working, I don't know https://huggingface.co/Nondzu/Mistral-7B-code-16k-qlora
@casper-hansen btw, tried to fix the code of main today, but faced a problem I am not sure how to handle. in quantizer.py and scaler.py, For example: https://github.com/casper-hansen/AutoAWQ/blob/main/awq/quantize/quantizer.py#L332 You assume that scaler could be multiplied by fc.weights, but when the layer Linear4bit fc coming as input, its shape is (xxx, 1), while the shape of scalers is 4096 for example (without the view). Wondered if the right way to fix it is to torch repeat the scalers to the dim of fc, or just skip layers Linear4bit (layers with names 4 bit, which came from b&b quan.) at all. The same thing happen in many places of the code and in scaler.py too. So I wonder if to do that to all (there are weights and biases ofc) Also, you assume inside some functions, like here: https://github.com/casper-hansen/AutoAWQ/blob/main/awq/quantize/quantizer.py#L66
That the right dim to take is -1 and not 0 (batch), but in cases where thee are 2 dims only, it could be that -1 is the right dim. Wondered what to do (could implement both, not sure which one is more right)
This is the merged model layers' archi.: MixtralForCausalLM( (model): MixtralModel( (embed_tokens): Embedding(32000, 4096) (layers): ModuleList( (0-31): 32 x MixtralDecoderLayer( (self_attn): MixtralFlashAttention2( (q_proj): Linear4bit(in_features=4096, out_features=4096, bias=False) (k_proj): Linear4bit(in_features=4096, out_features=1024, bias=False) (v_proj): Linear4bit(in_features=4096, out_features=1024, bias=False) (o_proj): Linear4bit(in_features=4096, out_features=4096, bias=False) (rotary_emb): MixtralRotaryEmbedding() ) (block_sparse_moe): MixtralSparseMoeBlock( (gate): Linear4bit(in_features=4096, out_features=8, bias=False) (experts): ModuleList( (0-7): 8 x MixtralBlockSparseTop2MLP( (w1): Linear4bit(in_features=4096, out_features=14336, bias=False) (w2): Linear4bit(in_features=14336, out_features=4096, bias=False) (w3): Linear4bit(in_features=4096, out_features=14336, bias=False) (act_fn): SiLU() ) ) ) (input_layernorm): MixtralRMSNorm() (post_attention_layernorm): MixtralRMSNorm() ) ) (norm): MixtralRMSNorm() ) (lm_head): Linear(in_features=4096, out_features=32000, bias=False) ) @casper-hansen
Hello, Wanted to quantize the model via awq after a merged qlora b&b nf4 mixtral moe.
the error is:
looks like a broadcast problem with the data.
here's the quantize profile
based on: https://huggingface.co/casperhansen/mixtral-instruct-awq
Here's the req.txt
tried autoawq with latest commit e9f62694a867a7a0b2f5e469fcbd914ce5ae0970 because of transformers 4.39.3.. Also tried version 0.2.4 from pypi with transformers 4.38.x without success
the lora profile is standard (without any special w1,2,... layers, regular qkvo (tried before with w1,w2,w3, same error)
@casper-hansen it is also based on the assumption this guy succeeded doing it https://huggingface.co/Nondzu/Mistral-7B-code-16k-qlora he had gptq and awq successfully after qlora b&b