IST-DASLab / QUIK

Repository for the QUIK project, enabling the use of 4bit kernels for generative inference - EMNLP 2024
Apache License 2.0
171 stars 12 forks source link

int8FusedDequantizeCUDA Inference Results are Incorrect #14

Open zkf331 opened 5 months ago

zkf331 commented 5 months ago

I am attempting to perform W8A8 quantization using the int8FusedDequantizeCUDA operator, but the inference results are NaN. The code is as follows:

Modifications in qlinear.py:

qint_x = shared_input.qint_x    # qint_x shape: [M , K]
int_weight = self.int_weight    # int_weight shape: [N, K]
scale_row = shared_input.meta[None, 0::2].contiguous()   # scale_row shape: [1, M]
zero_row = shared_input.meta[None, 1::2].contiguous()    # zero_row shape: [1, M]
weights_scales = self.weights_scales.transpose(0, 1)     # weights_scales: [1, N]
reduced_w = self.reduced_w                               # reduced_w: [1, N]

shift_value = 128.0
output = quik.asymmetric.int8FusedDequantize(
    qint_x, 
    int_weight,
    scale_row,
    weights_scales,
    shift_value,
    zero_row,
    reduced_w,
    fp_result)

Is there an issue with the operator itself or am I using it incorrectly? Could you please provide some suggestions? Thank you very much.