Closed xianwujie closed 6 months ago
A possible "workaround" to try would be adding this padding_mask
param in QuantLlamaDecoderLayer
:
def forward(
self,
hidden_states: torch.Tensor,
attention_mask: Optional[torch.Tensor] = None,
position_ids: Optional[torch.LongTensor] = None,
past_key_value: Optional[Tuple[torch.Tensor]] = None,
output_attentions: Optional[bool] = False,
use_cache: Optional[bool] = False,
padding_mask: Optional[torch.Tensor] = None, # <- Here
) -> Tuple[torch.FloatTensor, Optional[Tuple[torch.FloatTensor, torch.FloatTensor]]]:
https://github.com/OpenGVLab/OmniQuant/blob/main/models/int_llama_layer.py#L220
It might work but I can't guarantee that it will lead to correct results :sweat_smile:
However, you can check the file /usr/local/lib/python3.10/dist-packages/transformers/models/llama/modeling_llama.py
and make sure that padding_mask
is not used anywhere in computation. If that is true, than the workaround above might be OK.
Hi, I have a problem evaluating the quantified llama-2-7b model,, can anyone help?
I quantize llama-2-7b model with below command:
and get error as follow:
my environment is configured as follows: