NaNs in sequence classifier output

Hello,

I have an enormous amount of nan and inf in outputs of quantized models for sequence classification. It is not the case with non-quantized models, which never outputs nans whatever the sequence classification head initialisation. I conclude it must come from the quantization of the classification head.

This is not due to the model outputing nans, as I tested the quantized version for text generation without any problem.

This behavior is observed for several quantized models.

import torch
from transformers import AutoModelForSequenceClassification, AutoTokenizer

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

model_id = "ISTA-DASLab/Mixtral-8x7b-AQLM-2Bit-1x16-hf"
# model_id = "ISTA-DASLab/Mistral-7B-v0.1-AQLM-PV-2Bit-1x16-hf"
# model_id = "ISTA-DASLab/Meta-Llama-3-8B-AQLM-PV-1Bit-1x16"
# model_id = "ISTA-DASLab/Meta-Llama-3-70B-AQLM-PV-2Bit-1x16"

quantized_model = AutoModelForSequenceClassification.from_pretrained(
    model_id,
    trust_remote_code=True,
    torch_dtype=torch.bfloat16,
    device_map=device,
    num_labels=2,
)

tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token
quantized_model.config.pad_token_id = tokenizer.eos_token_id

input_tokens = tokenizer([
    "There is a problem somewhere",
    "This is not related to the num_labels argument",
    "There are nan or inf way to many times to be normal"
    ],
    padding='longest'
    )

output = quantized_model(
    input_ids=torch.tensor(input_tokens['input_ids']).to(device),
    attention_mask=torch.tensor(input_tokens['attention_mask']).to(device)
    )
output.logits
>>>
# tensor([[nan, nan],
#         [nan, nan],
#         [nan, nan]], device='cuda:0', dtype=torch.bfloat16,
#        grad_fn=<IndexBackward0>)

For now, replacing the last layer with a non-quantized one solves the issue, but it is probably not an ideal solution.

new_head = torch.nn.modules.linear.Linear(
    quantized_model.score.in_features,
    quantized_model.score.out_features,
    device=device,
    dtype=torch.bfloat16
    )
setattr(quantized_model, "score", new_head)
type(quantized_model.score)

aqlm version : 1.1.6 torch : 2.3.1+cu121 nvcc version : 12.1 gcc version : 10.5.0 GPU : NVIDIA L40S

Vahe1994 / AQLM

NaNs in sequence classifier output #106