Pass the model as .pt from args!

NamburiSrinath commented 1 year ago

Hi,

Thanks for the repository, super helpful.

I've a model which I am quantizing and am planning to understand the effect by running few tasks on this benchmark.

Here's the command

python main.py --model hf-causal --model_args pretrained=model-quantized.pt --tasks a,b --device cpu > output.log"

Now, as I quantized the model, I am unable to use save_pretrained method (ref - Issue1, Issue2 ). Otherwise I can have a checkpoint folder and write the command present in ReadMe.

So, I am unable to figure out a way to pass the quantized model in the args!

Any help/directions will be much appreciated :)

haileyschoelkopf commented 1 year ago

Hi! Does the quantized model file work with .from_pretrained(<dir where .pt file is saved>) ? Or does this require extra code to load the quantized weights into an HF class?

NamburiSrinath commented 1 year ago

Hi @haileyschoelkopf,

Thanks for your quick response :)

I don't think the HF from_pretrained supports loading of quantized models :(

Refer:

So, this is what I did currently at the moment and am able to save the model in .bin format. But failed to work/load it to the the repo.

quantize_layers = {torch.nn.Linear}
model = LlamaForCausalLM.from_pretrained('path_to_baseline_model')
quantized_model = torch. quantization. quantize_dynamic(model, quantize_layers, type=torch.qint8)
print (quantized model)
quantized_output_dir = "quantized/"
if not os.path.exists(quantized_output_dir):
 os.makedirs (quantized_output_dir)
 quantized_model.config.save_pretrained(quantized_output_dir)
 torch. save(quantized_model.state_dict(), "quantized/pytorch_model.bin")
 print ("Model saved")

So, any suggestions/quick fixes I can do it on my end to get this working!

This might be helpful resource - https://discuss.huggingface.co/t/pegasus-model-weights-compression-pruning/6381/11

haileyschoelkopf commented 1 year ago

I see-- if you have any code to load this saved file into your model, you could add that yourself to the HFLM initialization method, otherwise (or regardless) I think the best option for this would be to implement the ability to wrap an initialized HF model in the LM class as described in https://github.com/EleutherAI/lm-evaluation-harness/issues/521 and perform quantization on your model before passing it to be wrapped by an LM class.

NamburiSrinath commented 1 year ago

Hi,

Thanks for the response. Correct me if my understanding is correct!

Current way: Initialize HF model -> Quantize -> Try to pass it to LM class (fails) Suggested way: Initialize HF model -> Pass it to LM class (works as it's still fp16/fp32) -> Quantize the model here (int8) -> Evaluate on tasks

haileyschoelkopf commented 1 year ago

Not quite:

Suggested way: Initialize HF model -> Quantize -> Try to pass it to LM class (passing initialized models not yet implemented, needs implementation) Alternative: pass pretrained=<your model name> to --model_args, write code to quantize model after loading in HFLM.init()

NamburiSrinath commented 1 year ago

Cool, thanks a lot @haileyschoelkopf I'll try the alternative way and get back to you in a couple of days incase I've any questions and hope that's ok!

Do you think there'll be any contribution on this in coming days from you/team? I don't have much experience in contributions/code development; so any help from the team will be greatly appreciated!
Is the HFLM you are referring to - lm_eval/models/gpt2.py or lm_eval/models/huggingface.py! I am a bit confused as my model is Llama based one.

NamburiSrinath commented 1 year ago

Hi @haileyschoelkopf, @StellaAthena

Thank you so much for your support! I tried the alternative way, but apparently got stuck in evaluate function

Error stack

Running loglikelihood requests
^M  0%|          | 0/12750 [00:00<?, ?it/s]^M  0%|          | 15/12750 [00:00<08:47, 24.16it/s]
Traceback (most recent call last):
  File "/lm-evaluation-harness/main.py", line 90, in <module>
    main()
  File "/lm-evaluation-harness/main.py", line 58, in main
    results = evaluator.simple_evaluate(
  File "/lm-evaluation-harness/lm_eval/utils.py", line 242, in _wrapper
    return fn(*args, **kwargs)
  File "l/lm-evaluation-harness/lm_eval/evaluator.py", line 94, in simple_evaluate
    results = evaluate(
  File "/lm-evaluation-harness/lm_eval/utils.py", line 242, in _wrapper
    return fn(*args, **kwargs)
  File "/lm-evaluation-harness/lm_eval/evaluator.py", line 288, in evaluate
    resps = getattr(lm, reqtype)([req.args for req in reqs])
  File "/lm-evaluation-harness/lm_eval/base.py", line 891, in fn
    rem_res = getattr(self.lm, attr)(remaining_reqs)
  File "/lm-evaluation-harness/lm_eval/base.py", line 201, in loglikelihood
    return self._loglikelihood_tokens(new_reqs)
  File "/lm-evaluation-harness/lm_eval/base.py", line 359, in _loglikelihood_tokens
    self._model_call(batched_inps), dim=-1
  File "/lm-evaluation-harness/lm_eval/models/gpt2.py", line 176, in _model_call
    return self.gpt2(inps)[0]
  File "/u/s/r/srinath_97/srinath_97/anaconda_env/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/u/s/r/srinath_97/srinath_97/anaconda_env/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 687, in forward
    outputs = self.model(
  File "/u/s/r/srinath_97/srinath_97/anaconda_env/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/u/s/r/srinath_97/srinath_97/anaconda_env/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 577, in forward
    layer_outputs = decoder_layer(
  File "/u/s/r/srinath_97/srinath_97/anaconda_env/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/u/s/r/srinath_97/srinath_97/anaconda_env/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 292, in forward
    hidden_states, self_attn_weights, present_key_value = self.self_attn(
  File "/u/s/r/srinath_97/srinath_97/anaconda_env/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/u/s/r/srinath_97/srinath_97/anaconda_env/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 196, in forward
    query_states = self.q_proj(hidden_states).view(bsz, q_len, self.num_heads, self.head_dim).transpose(1, 2)
  File "/u/s/r/srinath_97/srinath_97/anaconda_env/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/u/s/r/srinath_97/srinath_97/anaconda_env/lib/python3.10/site-packages/torch/ao/nn/quantized/dynamic/modules/linear.py", line 54, in forward
    Y = torch.ops.quantized.linear_dynamic(
  File "/u/s/r/srinath_97/srinath_97/anaconda_env/lib/python3.10/site-packages/torch/_ops.py", line 502, in __call__
    return self._op(*args, **kwargs or {})
RuntimeError: expected scalar type Float but found Half

I made necessary changes as suggested by @haileyschoelkopf and added a quantization_type arg in HFLM init

Command: nohup python main.py --model hf-causal --model_args pretrained=Llama_base_model_path,dtype=float16,quantization_flag='type_of_quantization'--tasks a,b,c --device cpu --batch_size=16 > output.log

Running on CPU as Pytorch Quantization is not supported in GPU. And I am doing dynamic quantization on all layers of Llama, here's the model for reference

LlamaForCausalLM(
  (model): LlamaModel(
    (embed_tokens): Embedding(32000, 4096, padding_idx=0)
    (layers): ModuleList(
      (0-31): 32 x LlamaDecoderLayer(
        (self_attn): LlamaAttention(
          (q_proj): DynamicQuantizedLinear(in_features=4096, out_features=4096, dtype=torch.qint8, qscheme=torch.per_tensor_affine)
          (k_proj): DynamicQuantizedLinear(in_features=4096, out_features=4096, dtype=torch.qint8, qscheme=torch.per_tensor_affine)
          (v_proj): DynamicQuantizedLinear(in_features=4096, out_features=4096, dtype=torch.qint8, qscheme=torch.per_tensor_affine)
          (o_proj): DynamicQuantizedLinear(in_features=4096, out_features=4096, dtype=torch.qint8, qscheme=torch.per_tensor_affine)
          (rotary_emb): LlamaRotaryEmbedding()
        )
        (mlp): LlamaMLP(
          (gate_proj): DynamicQuantizedLinear(in_features=4096, out_features=11008, dtype=torch.qint8, qscheme=torch.per_tensor_affine)
          (down_proj): DynamicQuantizedLinear(in_features=11008, out_features=4096, dtype=torch.qint8, qscheme=torch.per_tensor_affine)
          (up_proj): DynamicQuantizedLinear(in_features=4096, out_features=11008, dtype=torch.qint8, qscheme=torch.per_tensor_affine)
          (act_fn): SiLUActivation()
        )
        (input_layernorm): LlamaRMSNorm()
        (post_attention_layernorm): LlamaRMSNorm()
      )
    )
    (norm): LlamaRMSNorm()
  )
  (lm_head): DynamicQuantizedLinear(in_features=4096, out_features=32000, dtype=torch.qint8, qscheme=torch.per_tensor_affine)
)

Any help would be greatly appreciated and I can contribute if needed!

Thanks :)

StellaAthena commented 1 year ago

Is your model compatible with AutoGPTQ? We recently added support loading models with that framework.

If not, maybe you can base your implementation on that one.

NamburiSrinath commented 1 year ago

I've to verify few things regarding AutoGPTQ. In my usecase, I am quantizing specific layers (so I am not sure if AutoGPTQ has that functionality to select layers/modules that I want to quantize)

Also, Torch Dynamic Quantization has been around for a while, so when I started my work, I picked that one!

Will let you know if I can get anything useful

Thanks :)

NamburiSrinath commented 1 year ago

Hi,

Here's what I did - Changed the float type to 32 as Pytorch supports float32 for many cases of quantizations

Command: nohup python main.py --model hf-causal --model_args pretrained=Llama_base_model_path,dtype=float32,quantization_flag='type_of_quantization'--tasks a,b,c --device cpu --batch_size=16 > output.log

Code change - As @haileyschoelkopf suggested, here did quantization instead of loading the model directly!. And it turns out, it worked fine at least for now

So, closing the thread and will reopen a new one incase something is off!

Thanks so much :)

haileyschoelkopf commented 1 year ago

glad this worked for you!!

EleutherAI / lm-evaluation-harness

Pass the model as .pt from args! #535