Closed NamburiSrinath closed 1 year ago
Hi! Does the quantized model file work with .from_pretrained(<dir where .pt file is saved>) ? Or does this require extra code to load the quantized weights into an HF class?
Hi @haileyschoelkopf,
Thanks for your quick response :)
I don't think the HF from_pretrained
supports loading of quantized models :(
Refer:
So, this is what I did currently at the moment and am able to save the model in .bin format. But failed to work/load it to the the repo.
quantize_layers = {torch.nn.Linear}
model = LlamaForCausalLM.from_pretrained('path_to_baseline_model')
quantized_model = torch. quantization. quantize_dynamic(model, quantize_layers, type=torch.qint8)
print (quantized model)
quantized_output_dir = "quantized/"
if not os.path.exists(quantized_output_dir):
os.makedirs (quantized_output_dir)
quantized_model.config.save_pretrained(quantized_output_dir)
torch. save(quantized_model.state_dict(), "quantized/pytorch_model.bin")
print ("Model saved")
So, any suggestions/quick fixes I can do it on my end to get this working!
This might be helpful resource - https://discuss.huggingface.co/t/pegasus-model-weights-compression-pruning/6381/11
I see-- if you have any code to load this saved file into your model, you could add that yourself to the HFLM initialization method, otherwise (or regardless) I think the best option for this would be to implement the ability to wrap an initialized HF model in the LM class as described in https://github.com/EleutherAI/lm-evaluation-harness/issues/521 and perform quantization on your model before passing it to be wrapped by an LM class.
Hi,
Thanks for the response. Correct me if my understanding is correct!
Current way: Initialize HF model -> Quantize -> Try to pass it to LM class (fails) Suggested way: Initialize HF model -> Pass it to LM class (works as it's still fp16/fp32) -> Quantize the model here (int8) -> Evaluate on tasks
Not quite:
Suggested way: Initialize HF model -> Quantize -> Try to pass it to LM class (passing initialized models not yet implemented, needs implementation)
Alternative: pass pretrained=<your model name>
to --model_args
, write code to quantize model after loading in HFLM.init()
Cool, thanks a lot @haileyschoelkopf I'll try the alternative way and get back to you in a couple of days incase I've any questions and hope that's ok!
lm_eval/models/gpt2.py
or lm_eval/models/huggingface.py
! I am a bit confused as my model is Llama based one.Hi @haileyschoelkopf, @StellaAthena
Thank you so much for your support! I tried the alternative way, but apparently got stuck in evaluate function
Error stack
Running loglikelihood requests
^M 0%| | 0/12750 [00:00<?, ?it/s]^M 0%| | 15/12750 [00:00<08:47, 24.16it/s]
Traceback (most recent call last):
File "/lm-evaluation-harness/main.py", line 90, in <module>
main()
File "/lm-evaluation-harness/main.py", line 58, in main
results = evaluator.simple_evaluate(
File "/lm-evaluation-harness/lm_eval/utils.py", line 242, in _wrapper
return fn(*args, **kwargs)
File "l/lm-evaluation-harness/lm_eval/evaluator.py", line 94, in simple_evaluate
results = evaluate(
File "/lm-evaluation-harness/lm_eval/utils.py", line 242, in _wrapper
return fn(*args, **kwargs)
File "/lm-evaluation-harness/lm_eval/evaluator.py", line 288, in evaluate
resps = getattr(lm, reqtype)([req.args for req in reqs])
File "/lm-evaluation-harness/lm_eval/base.py", line 891, in fn
rem_res = getattr(self.lm, attr)(remaining_reqs)
File "/lm-evaluation-harness/lm_eval/base.py", line 201, in loglikelihood
return self._loglikelihood_tokens(new_reqs)
File "/lm-evaluation-harness/lm_eval/base.py", line 359, in _loglikelihood_tokens
self._model_call(batched_inps), dim=-1
File "/lm-evaluation-harness/lm_eval/models/gpt2.py", line 176, in _model_call
return self.gpt2(inps)[0]
File "/u/s/r/srinath_97/srinath_97/anaconda_env/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/u/s/r/srinath_97/srinath_97/anaconda_env/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 687, in forward
outputs = self.model(
File "/u/s/r/srinath_97/srinath_97/anaconda_env/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/u/s/r/srinath_97/srinath_97/anaconda_env/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 577, in forward
layer_outputs = decoder_layer(
File "/u/s/r/srinath_97/srinath_97/anaconda_env/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/u/s/r/srinath_97/srinath_97/anaconda_env/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 292, in forward
hidden_states, self_attn_weights, present_key_value = self.self_attn(
File "/u/s/r/srinath_97/srinath_97/anaconda_env/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/u/s/r/srinath_97/srinath_97/anaconda_env/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 196, in forward
query_states = self.q_proj(hidden_states).view(bsz, q_len, self.num_heads, self.head_dim).transpose(1, 2)
File "/u/s/r/srinath_97/srinath_97/anaconda_env/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/u/s/r/srinath_97/srinath_97/anaconda_env/lib/python3.10/site-packages/torch/ao/nn/quantized/dynamic/modules/linear.py", line 54, in forward
Y = torch.ops.quantized.linear_dynamic(
File "/u/s/r/srinath_97/srinath_97/anaconda_env/lib/python3.10/site-packages/torch/_ops.py", line 502, in __call__
return self._op(*args, **kwargs or {})
RuntimeError: expected scalar type Float but found Half
I made necessary changes as suggested by @haileyschoelkopf and added a quantization_type arg in HFLM init
Command: nohup python main.py --model hf-causal --model_args pretrained=Llama_base_model_path,dtype=float16,quantization_flag='type_of_quantization'--tasks a,b,c --device cpu --batch_size=16 > output.log
Running on CPU as Pytorch Quantization is not supported in GPU. And I am doing dynamic quantization on all layers of Llama, here's the model for reference
LlamaForCausalLM(
(model): LlamaModel(
(embed_tokens): Embedding(32000, 4096, padding_idx=0)
(layers): ModuleList(
(0-31): 32 x LlamaDecoderLayer(
(self_attn): LlamaAttention(
(q_proj): DynamicQuantizedLinear(in_features=4096, out_features=4096, dtype=torch.qint8, qscheme=torch.per_tensor_affine)
(k_proj): DynamicQuantizedLinear(in_features=4096, out_features=4096, dtype=torch.qint8, qscheme=torch.per_tensor_affine)
(v_proj): DynamicQuantizedLinear(in_features=4096, out_features=4096, dtype=torch.qint8, qscheme=torch.per_tensor_affine)
(o_proj): DynamicQuantizedLinear(in_features=4096, out_features=4096, dtype=torch.qint8, qscheme=torch.per_tensor_affine)
(rotary_emb): LlamaRotaryEmbedding()
)
(mlp): LlamaMLP(
(gate_proj): DynamicQuantizedLinear(in_features=4096, out_features=11008, dtype=torch.qint8, qscheme=torch.per_tensor_affine)
(down_proj): DynamicQuantizedLinear(in_features=11008, out_features=4096, dtype=torch.qint8, qscheme=torch.per_tensor_affine)
(up_proj): DynamicQuantizedLinear(in_features=4096, out_features=11008, dtype=torch.qint8, qscheme=torch.per_tensor_affine)
(act_fn): SiLUActivation()
)
(input_layernorm): LlamaRMSNorm()
(post_attention_layernorm): LlamaRMSNorm()
)
)
(norm): LlamaRMSNorm()
)
(lm_head): DynamicQuantizedLinear(in_features=4096, out_features=32000, dtype=torch.qint8, qscheme=torch.per_tensor_affine)
)
Any help would be greatly appreciated and I can contribute if needed!
Thanks :)
Is your model compatible with AutoGPTQ? We recently added support loading models with that framework.
If not, maybe you can base your implementation on that one.
I've to verify few things regarding AutoGPTQ. In my usecase, I am quantizing specific layers (so I am not sure if AutoGPTQ has that functionality to select layers/modules that I want to quantize)
Also, Torch Dynamic Quantization has been around for a while, so when I started my work, I picked that one!
Will let you know if I can get anything useful
Thanks :)
Hi,
Here's what I did - Changed the float type to 32 as Pytorch supports float32 for many cases of quantizations
Command: nohup python main.py --model hf-causal --model_args pretrained=Llama_base_model_path,dtype=float32,quantization_flag='type_of_quantization'--tasks a,b,c --device cpu --batch_size=16 > output.log
Code change - As @haileyschoelkopf suggested, here did quantization instead of loading the model directly!. And it turns out, it worked fine at least for now
So, closing the thread and will reopen a new one incase something is off!
Thanks so much :)
glad this worked for you!!
Hi,
Thanks for the repository, super helpful.
I've a model which I am quantizing and am planning to understand the effect by running few tasks on this benchmark.
Here's the command
python main.py --model hf-causal --model_args pretrained=model-quantized.pt --tasks a,b --device cpu > output.log"
Now, as I quantized the model, I am unable to use
save_pretrained
method (ref - Issue1, Issue2 ). Otherwise I can have a checkpoint folder and write the command present in ReadMe.So, I am unable to figure out a way to pass the quantized model in the args!
Any help/directions will be much appreciated :)