Xilinx / brevitas

Brevitas: neural network quantization in PyTorch
https://xilinx.github.io/brevitas/
Other
1.21k stars 197 forks source link

Adding tests for "quantize" function for CNN PTQ #908

Open Giuseppe5 opened 8 months ago

Giuseppe5 commented 8 months ago

Here we keep track of what part of quantize in ptq_common.py are tested and what are still missing.

OscarSavolainenDR commented 8 months ago

I've started working on making unit tests for quantize_model, and I had a few questions:

So yeah, I was wondering if this was all expected behavior. If so, I can add some appropriate documentation! If not, I can start working on "fixes".

Giuseppe5 commented 8 months ago

quantize_model seems to expect a certain model

The parts of the pre-processing that might be needed are mostly the following: https://github.com/Xilinx/brevitas/blob/master/src/brevitas/graph/quantize.py#L275-L280

These are not always needed and there are cases when they can be skipped, except maybe only for symbolic trace which is required with FX quantization backend. Having them makes the quantization process easier. Depending on how you were planning to write the tests, maybe you can just apply symbolic trace to obtain an FX graph, and ignore all the other ones.

If they should only ever be called together, should those functions be packaged together?

Conceptually, they do very different things. They are coupled for the sake of these examples but there are cases where those transformations should not be applied or they are not interesting for the model in case.

The function doesn't work for quantizing Transformer models

That is expected. We have a separate entrypoint for LLM quantization and we would like to unify the two at some point. To do that, first we might need tests to ensure we preserve all the correct functionalities.

I haven't yet found a case where weight_bit_width or act_bit_width input variables for quantize_model have an impact on the model.

Could you post an example?

OscarSavolainenDR commented 8 months ago

The parts of the pre-processing that might be needed are mostly the following: https://github.com/Xilinx/brevitas/blob/master/src/brevitas/graph/quantize.py#L275-L280

These are not always needed and there are cases when they can be skipped, except maybe only for symbolic trace which is required with FX quantization backend. Having them makes the quantization process easier. Depending on how you were planning to write the tests, maybe you can just apply symbolic trace to obtain an FX graph, and ignore all the other ones.

If they should only ever be called together, should those functions be packaged together?

Conceptually, they do very different things. They are coupled for the sake of these examples but there are cases where those transformations should not be applied or they are not interesting for the model in case.

Sounds good! I'll experiment a bit with the pre-processing, but will use symbolic trace as the default for now.

The function doesn't work for quantizing Transformer models

That is expected. We have a separate entrypoint for LLM quantization and we would like to unify the two at some point. To do that, first we might need tests to ensure we preserve all the correct functionalities.

That makes sense, in that case I'll delay my Transformer-based test until then.

I haven't yet found a case where weight_bit_width or act_bit_width input variables for quantize_model have an impact on the model.

Could you post an example?

I'm not entirely sure what the issue was: I was getting 8-bit quantization in every case in my larger example. I'm going to dig into it and see what my error was, and post a minimal example.

However, in the meantime below is an example using the fx backend where all tests now pass successfully and as expected for arbitraryweight_bit_width and act_bit_width values:


import pytest
from copy import deepcopy
import torch
import torch.nn as nn
from brevitas_examples.imagenet_classification.ptq.ptq_common import quantize_model
from brevitas.quant_tensor import QuantTensor

# CONSTANTS
IMAGE_DIM = 16

##################
# EXAMPLE MODELS #
##################
@pytest.fixture
def minimal_model():
    """
    Inputs:
    Implicitly takes in a torch.Tensor, size: (batch_size, 3, height, width).

    Outputs:
    Implicitly returns a torch.Tensor, size: (batch_size, 16, height, width).

    """
    return nn.Sequential(
        nn.Conv2d(3, 16, kernel_size=3, padding=1),
        nn.ReLU(),
    )

# Unit tests
def test_quantize_model(minimal_model):

    # Tested parameters
    weight_bit_width = 3
    bias_bit_width = 16
    act_bit_width = 6

    prepared_model = torch.fx.symbolic_trace(minimal_model)
    quant_model = quantize_model(
        model=deepcopy(prepared_model),
        backend='fx',
        weight_bit_width=weight_bit_width, 
        act_bit_width=act_bit_width, 
        bias_bit_width=bias_bit_width,
        weight_quant_granularity='per_tensor',
        act_quant_percentile=99.9,
        act_quant_type='sym',
        scale_factor_type='float_scale',
        quant_format='int'
    )
    # Assert it is a GraphModule
    assert isinstance(quant_model, torch.fx.graph_module.GraphModule)

    # Make sure we can feed data through the model
    _ = quant_model(torch.rand(1,3,IMAGE_DIM, IMAGE_DIM))

    # Get first layer for testing its quantization.
    # We also test we can feed data through the first layer and quant stub in isolation
    initial_quant = quant_model.get_submodule('input_1_quant')
    first_layer = quant_model.get_submodule('0')
    first_quant_input = initial_quant(torch.rand(1,3,IMAGE_DIM, IMAGE_DIM))
    first_layer_output = first_layer(first_quant_input)

    # Assert only weight and bias are quantized by default
    assert first_layer.is_weight_quant_enabled
    assert first_layer.is_bias_quant_enabled
    assert not first_layer.is_input_quant_enabled
    assert not first_layer.is_output_quant_enabled

    # Assert quantization bit widths are as desired
    # Bias
    assert first_layer.bias_quant.tensor_quant.msb_clamp_bit_width_impl.bit_width._buffers['value'].item() == bias_bit_width
    # Weight
    assert first_layer.weight_quant.tensor_quant.msb_clamp_bit_width_impl.bit_width._buffers['value'].item() == weight_bit_width
    # Activation
    # Output of initial quant stub
    assert initial_quant.act_quant.fused_activation_quant_proxy.tensor_quant.msb_clamp_bit_width_impl.bit_width._buffers['value'].item() == act_bit_width
    assert isinstance(first_quant_input, QuantTensor)
    assert first_quant_input.bit_width.item() == act_bit_width

    # Output of Conv
    assert first_layer.output_quant._zero_hw_sentinel._buffers['value'].item() == 0 # quantization of input disabled
    assert not isinstance(first_layer_output, QuantTensor) and isinstance(first_layer_output, torch.Tensor)
OscarSavolainenDR commented 8 months ago

So the issue I ran into, where the weight_bit_width and act_bit_width values didn't seem to be used, occurred when using the layerwise backend.

Using the same minimal_model as above, the following test fails (in particular, the last 2 assertions):

def test_layerwise_quantize_model(minimal_model):

      # Tested parameters
      weight_bit_width = 3
      bias_bit_width = 16
      act_bit_width = 6

      quant_model = quantize_model(
          model=deepcopy(minimal_model),
          backend='layerwise',
          weight_bit_width=weight_bit_width, 
          act_bit_width=act_bit_width, 
          bias_bit_width=bias_bit_width,
          weight_quant_granularity='per_tensor',
          act_quant_percentile=99.9,
          act_quant_type='sym',
          scale_factor_type='float_scale',
          quant_format='int'
      )
      assert isinstance(quant_model, nn.Sequential)

      # Get first layer for testing its quantization.
      first_layer = quant_model.get_submodule('0')

      # Assert quantization bit widths are as desired
      # Biases
      assert first_layer.bias_quant.tensor_quant.msb_clamp_bit_width_impl.bit_width._buffers['value'].item() == bias_bit_width
      # Weights
      assert first_layer.weight_quant.tensor_quant.msb_clamp_bit_width_impl.bit_width._buffers['value'].item() == weight_bit_width
      # Activations
      assert first_layer.input_quant.fused_activation_quant_proxy.tensor_quant.msb_clamp_bit_width_impl.bit_width._buffers['value'].item() == act_bit_width

I stepped through the model, and saw that the input activation and weight tensor were being quantized to 8 bits, not the desired 6 and 3 respectively. Is this expected behavior?

Giuseppe5 commented 8 months ago

For layerwise quantization, there are special rules that we have for first and last layer. In particular, there are flags that specify the activation and weight bit width of first/last, since they tend to be more susceptible to lower precision.

The way we identify first/last in that function is a bit hard-coded around imagenet examples, where we check that the first layer has 3 input channel, and the last has 1000 output channels.

If you change the number of inp channel in your conv, you should see a difference

OscarSavolainenDR commented 8 months ago

I found the function that does the 3/1000 input/output channel identification:

    def layerwise_bit_width_fn(module, base_bit_width, first_last_bit_width):
        if isinstance(module, torch.nn.Conv2d) and module.in_channels == 3:
            return first_last_bit_width
        elif isinstance(module, torch.nn.Linear) and module.out_features == 1000:
            return first_last_bit_width
        else:
            return base_bit_width

I can confirm that changing the number of input channels made the quantization have the "normal" behavior. Is there any desire to un-hard-code the 3 vs 1000 channels thing? It seems a bit fragile, but I'd imagine that one would need to use FX mode to get insight into what is the first or last layer.

Alternatively, we could add a logged warning that we're defaulting to the default values if one chooses layerwise quantization, as the first/last layer being treated differently may be unexpected behavior. On that note, does Brevitas have a logger that it uses? I can see that there's a logger defined in src/brevitas_examples/bnn_pynq/logger.py.

I've opened up a PR with some preliminary tests, and I will be adding more tests (e.g. whatever tests you guys want!). The ones that are currently failing are in some cases when I give invalid inputs (e.g. when I give zero-valued or negative-valued bit widths), where quantize_model does not throw an error.

OscarSavolainenDR commented 7 months ago

I've done a bit of testing, and negative/zero bit widths are considered valid if the model isn't used for anything. E.g.

 quant_model = quantize_model(
            model=fx_model,
            backend='fx',
            weight_bit_width=0, # NOTE: this is considered valid, which may be an issue
            act_bit_width=0,
            bias_bit_width=32,
            weight_quant_granularity='per_tensor',
            act_quant_percentile=99.9,
            act_quant_type='sym',
            scale_factor_type='float_scale',
            quant_format='int',
        )
first_conv_layer = quant_model.get_submodule('0')
print(first_conv_layer.weight_quant.tensor_quant.msb_clamp_bit_width_impl.bit_width._buffers['value'].item())
>> 0.0

If one feeds data through the model with zero bit-widths, it outputs NaNs. However, if one feeds data through a model with negative bit widths, it still outputs values.

I'm digging into this a bit more because I'm curious as to what's happening inside the model, but in either case I would imagine we should add add some asserts to make sure all provided bit widths are positive integers. I opened a PR for it. I'm not sure if the integer constraint I added is desired behavior, or if I should add it to other functions as well.