clevercool / ANT-Quantization

80 stars 16 forks source link

Did you use the Fake Quantization in Olive? #8

Closed MaxwellWjj closed 1 year ago

MaxwellWjj commented 1 year ago

Thanks for the open-source Olive and ANT frameworks! I have a question about the quantization process: Did you use the fake quantization? I checked the codes and I found that before computation, you dequantized all the inputs and weights back to FP32 precision, which is like this plot (ref: I-BERT Figure. 1):

image

The source codes locate in _quantmodules.py

def _forward(self, data, display=False):
        scale = self.alpha / torch.max(self.quant_grid)

        if self.is_perchannel: 
            data = (data.view(data.shape[0], -1) / scale).view(data.shape)
        else:
            data = data / scale

        if not self.args.no_outlier:
            quant_grid = torch.cat((self.quant_grid, self.outliers), dim = 0)
        else:
            quant_grid = self.quant_grid

        quant_data = QuantBase.forward(data, quant_grid)
        shape = data.shape

        # Outlier Victim Pair Encoding
        if not self.args.no_outlier:
            quant_data = quant_data.view(-1)                
            mask = quant_data.abs() > 32
            victim_odd = torch.roll(mask, 1, -1)
            victim_odd[::2] = 0
            victim_even = torch.roll(mask & (~victim_odd), -1, -1)
            victim_even[1::2] = 0
            victim = victim_even | victim_odd
            quant_data = quant_data * (~victim)

        quant_data = quant_data.view(shape)
        tensor = (quant_data - data).detach() + data

        if self.is_perchannel:
            tensor = (tensor.view(tensor.shape[0], -1) * scale).view(data.shape)
        else:
            tensor = tensor * scale

        return tensor

My understanding is that the input tensor is based on FP32, and you divide the tensor based on the scale to fit the quant_grid range, then quantize the tensor by looking up from the quant_grid. But, at the end of this quantizer forward function, you multiply all the tensor values with scale, which means all of them will be dequantized back to FP32 for further computation.

If so, I think the software codes are inconsistent with the hardware design because, in the accelerator, you claimed that it is based on Exp4+Int4, not the FP32.

image

In other words, let me use class Conv2dQuantizer as the example (line 389, _quantmodules.py):

    def forward(self, input):
        weight = self.quant_weight(self.weight, input)
        input = self.quant_input(input, self.weight)
        return self._conv_forward(input, weight)

The weight and input should be consistent with your real values in Table 4:

image

unless the software codes may not fit the hardware design... But what I found from your codes is, you dequantized everything back to FP32 (original tensor format) before computing.

So, I am wondering if all of your accuracy results are based on dequantization (fake quantization) instead of real hardware.

Sakits commented 1 year ago

Hi @MaxwellWjj,

Thanks for your interests in our work! Yes, our results are based on fake quantization. Similar to Q8BERT and Outlier Suppression, it is a common practice to use fake quantization for accuracy evaluation and we can assure that the multiplication results after such operations can meet the requirements of the ANT and OliVe frameworks.

To run real quantization on GPU, one could write corresponding CUDA kernels to achieve simulation or even acceleration. And if the kernel is implemented correctly, the model's accuracy should not vary significantly from that of fake quantization.

MaxwellWjj commented 1 year ago

Hi @Sakits,

Thank you very much for your answer! Hope your team can optimize the framework to support real Flint&Abfloat implementation matching the accuracy performance in the future, without the fake quantization (because as we know it does not match the real 'simple' hardware and we still need FP32 units if we want to achieve this PTQ performance)

The open-source codes helped me a lot and thanks again :)