Adding tests for "quantize" function for CNN PTQ

Here we keep track of what part of quantize in ptq_common.py are tested and what are still missing.

I've started working on making unit tests for quantize_model, and I had a few questions:

quantize_model seems to expect a certain model. E.g., in ptq_evaluate, quantize_model is called after either preprocess_for_quantization or preprocess_for_flexml_quantize is called, depending on the quantization backend used.

Should I make tests with the understanding that the model should be pre-processed accordingly, or just test quantize_model as it is with any vanilla model? If they should only ever be called together, should those functions be packaged together?
The function doesn't work for quantizing Transformer models, because it quantizes the inputted tokens. Once those quantized (now float) tokens get to the embedding layer, it fails because of the non-integer tokens. This may be expected behavior for this function if it is built for quantizing CNN-based models, but I thought I would raise the issue. This is also an effect of the quantization of inputs being True by default in quantize_model (quantization is True for bias, input, and weight, but not for output), which is contrary to what is said to be the standard for layers in one of the tutorials.:
```
By default weight_quant=Int8WeightPerTensorFloat, while bias_quant, input_quant and output_quant are set to None. 

That means that by default weights are quantized to 8-bit signed integer with a per-tensor floating-point scale factor (a 
very common type of quantization adopted by e.g. the ONNX standard opset), while quantization of bias, input, and 
output are disabled. We can easily verify all of this at runtime on an example:
```
This may be desired/expected behavior for this function, but I wasn't sure.
I haven't yet found a case where weight_bit_width or act_bit_width input variables for quantize_model have an impact on the model. I.e., if I set both equal to 4 with symmetric quantization, and feed in a strictly positive tensor to a QuantizedConv, then the input activation tensor is quantized at 128 values (i.e. int8 symmetric quantization with a strictly positive input), and the weight tensor is quantized with int8 values.

So yeah, I was wondering if this was all expected behavior. If so, I can add some appropriate documentation! If not, I can start working on "fixes".

quantize_model seems to expect a certain model

The parts of the pre-processing that might be needed are mostly the following: https://github.com/Xilinx/brevitas/blob/master/src/brevitas/graph/quantize.py#L275-L280

These are not always needed and there are cases when they can be skipped, except maybe only for symbolic trace which is required with FX quantization backend. Having them makes the quantization process easier. Depending on how you were planning to write the tests, maybe you can just apply symbolic trace to obtain an FX graph, and ignore all the other ones.

If they should only ever be called together, should those functions be packaged together?

Conceptually, they do very different things. They are coupled for the sake of these examples but there are cases where those transformations should not be applied or they are not interesting for the model in case.

The function doesn't work for quantizing Transformer models

That is expected. We have a separate entrypoint for LLM quantization and we would like to unify the two at some point. To do that, first we might need tests to ensure we preserve all the correct functionalities.

I haven't yet found a case where weight_bit_width or act_bit_width input variables for quantize_model have an impact on the model.

Could you post an example?

The parts of the pre-processing that might be needed are mostly the following: https://github.com/Xilinx/brevitas/blob/master/src/brevitas/graph/quantize.py#L275-L280

These are not always needed and there are cases when they can be skipped, except maybe only for symbolic trace which is required with FX quantization backend. Having them makes the quantization process easier. Depending on how you were planning to write the tests, maybe you can just apply symbolic trace to obtain an FX graph, and ignore all the other ones.

If they should only ever be called together, should those functions be packaged together?

Conceptually, they do very different things. They are coupled for the sake of these examples but there are cases where those transformations should not be applied or they are not interesting for the model in case.

Sounds good! I'll experiment a bit with the pre-processing, but will use symbolic trace as the default for now.

The function doesn't work for quantizing Transformer models

That is expected. We have a separate entrypoint for LLM quantization and we would like to unify the two at some point. To do that, first we might need tests to ensure we preserve all the correct functionalities.

That makes sense, in that case I'll delay my Transformer-based test until then.

I haven't yet found a case where weight_bit_width or act_bit_width input variables for quantize_model have an impact on the model.

Could you post an example?

I'm not entirely sure what the issue was: I was getting 8-bit quantization in every case in my larger example. I'm going to dig into it and see what my error was, and post a minimal example.

However, in the meantime below is an example using the fx backend where all tests now pass successfully and as expected for arbitraryweight_bit_width and act_bit_width values:


import pytest
from copy import deepcopy
import torch
import torch.nn as nn
from brevitas_examples.imagenet_classification.ptq.ptq_common import quantize_model
from brevitas.quant_tensor import QuantTensor

# CONSTANTS
IMAGE_DIM = 16

##################
# EXAMPLE MODELS #
##################
@pytest.fixture
def minimal_model():
    """
    Inputs:
    Implicitly takes in a torch.Tensor, size: (batch_size, 3, height, width).

    Outputs:
    Implicitly returns a torch.Tensor, size: (batch_size, 16, height, width).

    """
    return nn.Sequential(
        nn.Conv2d(3, 16, kernel_size=3, padding=1),
        nn.ReLU(),
    )

# Unit tests
def test_quantize_model(minimal_model):

    # Tested parameters
    weight_bit_width = 3
    bias_bit_width = 16
    act_bit_width = 6

    prepared_model = torch.fx.symbolic_trace(minimal_model)
    quant_model = quantize_model(
        model=deepcopy(prepared_model),
        backend='fx',
        weight_bit_width=weight_bit_width, 
        act_bit_width=act_bit_width, 
        bias_bit_width=bias_bit_width,
        weight_quant_granularity='per_tensor',
        act_quant_percentile=99.9,
        act_quant_type='sym',
        scale_factor_type='float_scale',
        quant_format='int'
    )
    # Assert it is a GraphModule
    assert isinstance(quant_model, torch.fx.graph_module.GraphModule)

    # Make sure we can feed data through the model
    _ = quant_model(torch.rand(1,3,IMAGE_DIM, IMAGE_DIM))

    # Get first layer for testing its quantization.
    # We also test we can feed data through the first layer and quant stub in isolation
    initial_quant = quant_model.get_submodule('input_1_quant')
    first_layer = quant_model.get_submodule('0')
    first_quant_input = initial_quant(torch.rand(1,3,IMAGE_DIM, IMAGE_DIM))
    first_layer_output = first_layer(first_quant_input)

    # Assert only weight and bias are quantized by default
    assert first_layer.is_weight_quant_enabled
    assert first_layer.is_bias_quant_enabled
    assert not first_layer.is_input_quant_enabled
    assert not first_layer.is_output_quant_enabled

    # Assert quantization bit widths are as desired
    # Bias
    assert first_layer.bias_quant.tensor_quant.msb_clamp_bit_width_impl.bit_width._buffers['value'].item() == bias_bit_width
    # Weight
    assert first_layer.weight_quant.tensor_quant.msb_clamp_bit_width_impl.bit_width._buffers['value'].item() == weight_bit_width
    # Activation
    # Output of initial quant stub
    assert initial_quant.act_quant.fused_activation_quant_proxy.tensor_quant.msb_clamp_bit_width_impl.bit_width._buffers['value'].item() == act_bit_width
    assert isinstance(first_quant_input, QuantTensor)
    assert first_quant_input.bit_width.item() == act_bit_width

    # Output of Conv
    assert first_layer.output_quant._zero_hw_sentinel._buffers['value'].item() == 0 # quantization of input disabled
    assert not isinstance(first_layer_output, QuantTensor) and isinstance(first_layer_output, torch.Tensor)

So the issue I ran into, where the weight_bit_width and act_bit_width values didn't seem to be used, occurred when using the layerwise backend.

Using the same minimal_model as above, the following test fails (in particular, the last 2 assertions):

def test_layerwise_quantize_model(minimal_model):

      # Tested parameters
      weight_bit_width = 3
      bias_bit_width = 16
      act_bit_width = 6

      quant_model = quantize_model(
          model=deepcopy(minimal_model),
          backend='layerwise',
          weight_bit_width=weight_bit_width, 
          act_bit_width=act_bit_width, 
          bias_bit_width=bias_bit_width,
          weight_quant_granularity='per_tensor',
          act_quant_percentile=99.9,
          act_quant_type='sym',
          scale_factor_type='float_scale',
          quant_format='int'
      )
      assert isinstance(quant_model, nn.Sequential)

      # Get first layer for testing its quantization.
      first_layer = quant_model.get_submodule('0')

      # Assert quantization bit widths are as desired
      # Biases
      assert first_layer.bias_quant.tensor_quant.msb_clamp_bit_width_impl.bit_width._buffers['value'].item() == bias_bit_width
      # Weights
      assert first_layer.weight_quant.tensor_quant.msb_clamp_bit_width_impl.bit_width._buffers['value'].item() == weight_bit_width
      # Activations
      assert first_layer.input_quant.fused_activation_quant_proxy.tensor_quant.msb_clamp_bit_width_impl.bit_width._buffers['value'].item() == act_bit_width

I stepped through the model, and saw that the input activation and weight tensor were being quantized to 8 bits, not the desired 6 and 3 respectively. Is this expected behavior?

For layerwise quantization, there are special rules that we have for first and last layer. In particular, there are flags that specify the activation and weight bit width of first/last, since they tend to be more susceptible to lower precision.

The way we identify first/last in that function is a bit hard-coded around imagenet examples, where we check that the first layer has 3 input channel, and the last has 1000 output channels.

If you change the number of inp channel in your conv, you should see a difference

I found the function that does the 3/1000 input/output channel identification:

    def layerwise_bit_width_fn(module, base_bit_width, first_last_bit_width):
        if isinstance(module, torch.nn.Conv2d) and module.in_channels == 3:
            return first_last_bit_width
        elif isinstance(module, torch.nn.Linear) and module.out_features == 1000:
            return first_last_bit_width
        else:
            return base_bit_width

I can confirm that changing the number of input channels made the quantization have the "normal" behavior. Is there any desire to un-hard-code the 3 vs 1000 channels thing? It seems a bit fragile, but I'd imagine that one would need to use FX mode to get insight into what is the first or last layer.

Alternatively, we could add a logged warning that we're defaulting to the default values if one chooses layerwise quantization, as the first/last layer being treated differently may be unexpected behavior. On that note, does Brevitas have a logger that it uses? I can see that there's a logger defined in src/brevitas_examples/bnn_pynq/logger.py.

I've opened up a PR with some preliminary tests, and I will be adding more tests (e.g. whatever tests you guys want!). The ones that are currently failing are in some cases when I give invalid inputs (e.g. when I give zero-valued or negative-valued bit widths), where quantize_model does not throw an error.

I've done a bit of testing, and negative/zero bit widths are considered valid if the model isn't used for anything. E.g.

 quant_model = quantize_model(
            model=fx_model,
            backend='fx',
            weight_bit_width=0, # NOTE: this is considered valid, which may be an issue
            act_bit_width=0,
            bias_bit_width=32,
            weight_quant_granularity='per_tensor',
            act_quant_percentile=99.9,
            act_quant_type='sym',
            scale_factor_type='float_scale',
            quant_format='int',
        )
first_conv_layer = quant_model.get_submodule('0')
print(first_conv_layer.weight_quant.tensor_quant.msb_clamp_bit_width_impl.bit_width._buffers['value'].item())
>> 0.0

If one feeds data through the model with zero bit-widths, it outputs NaNs. However, if one feeds data through a model with negative bit widths, it still outputs values.

I'm digging into this a bit more because I'm curious as to what's happening inside the model, but in either case I would imagine we should add add some asserts to make sure all provided bit widths are positive integers. I opened a PR for it. I'm not sure if the integer constraint I added is desired behavior, or if I should add it to other functions as well.

Xilinx / brevitas

Adding tests for "quantize" function for CNN PTQ #908