Open Giuseppe5 opened 8 months ago
I've started working on making unit tests for quantize_model
, and I had a few questions:
quantize_model
seems to expect a certain model. E.g., in ptq_evaluate
, quantize_model
is called after either preprocess_for_quantization
or preprocess_for_flexml_quantize
is called, depending on the quantization backend used.
Should I make tests with the understanding that the model should be pre-processed accordingly, or just test quantize_model
as it is with any vanilla model? If they should only ever be called together, should those functions be packaged together?
The function doesn't work for quantizing Transformer models, because it quantizes the inputted tokens. Once those quantized (now float) tokens get to the embedding layer, it fails because of the non-integer tokens. This may be expected behavior for this function if it is built for quantizing CNN-based models, but I thought I would raise the issue. This is also an effect of the quantization of inputs being True
by default in quantize_model
(quantization is True for bias, input, and weight, but not for output), which is contrary to what is said to be the standard for layers in one of the tutorials.:
By default weight_quant=Int8WeightPerTensorFloat, while bias_quant, input_quant and output_quant are set to None.
That means that by default weights are quantized to 8-bit signed integer with a per-tensor floating-point scale factor (a
very common type of quantization adopted by e.g. the ONNX standard opset), while quantization of bias, input, and
output are disabled. We can easily verify all of this at runtime on an example:
This may be desired/expected behavior for this function, but I wasn't sure.
I haven't yet found a case where weight_bit_width
or act_bit_width
input variables for quantize_model
have an impact on the model. I.e., if I set both equal to 4 with symmetric quantization, and feed in a strictly positive tensor to a QuantizedConv, then the input activation tensor is quantized at 128 values (i.e. int8 symmetric quantization with a strictly positive input), and the weight tensor is quantized with int8 values.
So yeah, I was wondering if this was all expected behavior. If so, I can add some appropriate documentation! If not, I can start working on "fixes".
quantize_model seems to expect a certain model
The parts of the pre-processing that might be needed are mostly the following: https://github.com/Xilinx/brevitas/blob/master/src/brevitas/graph/quantize.py#L275-L280
These are not always needed and there are cases when they can be skipped, except maybe only for symbolic trace which is required with FX quantization backend. Having them makes the quantization process easier. Depending on how you were planning to write the tests, maybe you can just apply symbolic trace to obtain an FX graph, and ignore all the other ones.
If they should only ever be called together, should those functions be packaged together?
Conceptually, they do very different things. They are coupled for the sake of these examples but there are cases where those transformations should not be applied or they are not interesting for the model in case.
The function doesn't work for quantizing Transformer models
That is expected. We have a separate entrypoint for LLM quantization and we would like to unify the two at some point. To do that, first we might need tests to ensure we preserve all the correct functionalities.
I haven't yet found a case where weight_bit_width or act_bit_width input variables for quantize_model have an impact on the model.
Could you post an example?
The parts of the pre-processing that might be needed are mostly the following: https://github.com/Xilinx/brevitas/blob/master/src/brevitas/graph/quantize.py#L275-L280
These are not always needed and there are cases when they can be skipped, except maybe only for symbolic trace which is required with FX quantization backend. Having them makes the quantization process easier. Depending on how you were planning to write the tests, maybe you can just apply symbolic trace to obtain an FX graph, and ignore all the other ones.
If they should only ever be called together, should those functions be packaged together?
Conceptually, they do very different things. They are coupled for the sake of these examples but there are cases where those transformations should not be applied or they are not interesting for the model in case.
Sounds good! I'll experiment a bit with the pre-processing, but will use symbolic trace as the default for now.
The function doesn't work for quantizing Transformer models
That is expected. We have a separate entrypoint for LLM quantization and we would like to unify the two at some point. To do that, first we might need tests to ensure we preserve all the correct functionalities.
That makes sense, in that case I'll delay my Transformer-based test until then.
I haven't yet found a case where weight_bit_width or act_bit_width input variables for quantize_model have an impact on the model.
Could you post an example?
I'm not entirely sure what the issue was: I was getting 8-bit quantization in every case in my larger example. I'm going to dig into it and see what my error was, and post a minimal example.
However, in the meantime below is an example using the fx
backend where all tests now pass successfully and as expected for arbitraryweight_bit_width
and act_bit_width
values:
import pytest
from copy import deepcopy
import torch
import torch.nn as nn
from brevitas_examples.imagenet_classification.ptq.ptq_common import quantize_model
from brevitas.quant_tensor import QuantTensor
# CONSTANTS
IMAGE_DIM = 16
##################
# EXAMPLE MODELS #
##################
@pytest.fixture
def minimal_model():
"""
Inputs:
Implicitly takes in a torch.Tensor, size: (batch_size, 3, height, width).
Outputs:
Implicitly returns a torch.Tensor, size: (batch_size, 16, height, width).
"""
return nn.Sequential(
nn.Conv2d(3, 16, kernel_size=3, padding=1),
nn.ReLU(),
)
# Unit tests
def test_quantize_model(minimal_model):
# Tested parameters
weight_bit_width = 3
bias_bit_width = 16
act_bit_width = 6
prepared_model = torch.fx.symbolic_trace(minimal_model)
quant_model = quantize_model(
model=deepcopy(prepared_model),
backend='fx',
weight_bit_width=weight_bit_width,
act_bit_width=act_bit_width,
bias_bit_width=bias_bit_width,
weight_quant_granularity='per_tensor',
act_quant_percentile=99.9,
act_quant_type='sym',
scale_factor_type='float_scale',
quant_format='int'
)
# Assert it is a GraphModule
assert isinstance(quant_model, torch.fx.graph_module.GraphModule)
# Make sure we can feed data through the model
_ = quant_model(torch.rand(1,3,IMAGE_DIM, IMAGE_DIM))
# Get first layer for testing its quantization.
# We also test we can feed data through the first layer and quant stub in isolation
initial_quant = quant_model.get_submodule('input_1_quant')
first_layer = quant_model.get_submodule('0')
first_quant_input = initial_quant(torch.rand(1,3,IMAGE_DIM, IMAGE_DIM))
first_layer_output = first_layer(first_quant_input)
# Assert only weight and bias are quantized by default
assert first_layer.is_weight_quant_enabled
assert first_layer.is_bias_quant_enabled
assert not first_layer.is_input_quant_enabled
assert not first_layer.is_output_quant_enabled
# Assert quantization bit widths are as desired
# Bias
assert first_layer.bias_quant.tensor_quant.msb_clamp_bit_width_impl.bit_width._buffers['value'].item() == bias_bit_width
# Weight
assert first_layer.weight_quant.tensor_quant.msb_clamp_bit_width_impl.bit_width._buffers['value'].item() == weight_bit_width
# Activation
# Output of initial quant stub
assert initial_quant.act_quant.fused_activation_quant_proxy.tensor_quant.msb_clamp_bit_width_impl.bit_width._buffers['value'].item() == act_bit_width
assert isinstance(first_quant_input, QuantTensor)
assert first_quant_input.bit_width.item() == act_bit_width
# Output of Conv
assert first_layer.output_quant._zero_hw_sentinel._buffers['value'].item() == 0 # quantization of input disabled
assert not isinstance(first_layer_output, QuantTensor) and isinstance(first_layer_output, torch.Tensor)
So the issue I ran into, where the weight_bit_width
and act_bit_width
values didn't seem to be used, occurred when using the layerwise
backend.
Using the same minimal_model
as above, the following test fails (in particular, the last 2 assertions):
def test_layerwise_quantize_model(minimal_model):
# Tested parameters
weight_bit_width = 3
bias_bit_width = 16
act_bit_width = 6
quant_model = quantize_model(
model=deepcopy(minimal_model),
backend='layerwise',
weight_bit_width=weight_bit_width,
act_bit_width=act_bit_width,
bias_bit_width=bias_bit_width,
weight_quant_granularity='per_tensor',
act_quant_percentile=99.9,
act_quant_type='sym',
scale_factor_type='float_scale',
quant_format='int'
)
assert isinstance(quant_model, nn.Sequential)
# Get first layer for testing its quantization.
first_layer = quant_model.get_submodule('0')
# Assert quantization bit widths are as desired
# Biases
assert first_layer.bias_quant.tensor_quant.msb_clamp_bit_width_impl.bit_width._buffers['value'].item() == bias_bit_width
# Weights
assert first_layer.weight_quant.tensor_quant.msb_clamp_bit_width_impl.bit_width._buffers['value'].item() == weight_bit_width
# Activations
assert first_layer.input_quant.fused_activation_quant_proxy.tensor_quant.msb_clamp_bit_width_impl.bit_width._buffers['value'].item() == act_bit_width
I stepped through the model, and saw that the input activation and weight tensor were being quantized to 8 bits, not the desired 6 and 3 respectively. Is this expected behavior?
For layerwise quantization, there are special rules that we have for first and last layer. In particular, there are flags that specify the activation and weight bit width of first/last, since they tend to be more susceptible to lower precision.
The way we identify first/last in that function is a bit hard-coded around imagenet examples, where we check that the first layer has 3 input channel, and the last has 1000 output channels.
If you change the number of inp channel in your conv, you should see a difference
I found the function that does the 3/1000 input/output channel identification:
def layerwise_bit_width_fn(module, base_bit_width, first_last_bit_width):
if isinstance(module, torch.nn.Conv2d) and module.in_channels == 3:
return first_last_bit_width
elif isinstance(module, torch.nn.Linear) and module.out_features == 1000:
return first_last_bit_width
else:
return base_bit_width
I can confirm that changing the number of input channels made the quantization have the "normal" behavior. Is there any desire to un-hard-code the 3 vs 1000 channels thing? It seems a bit fragile, but I'd imagine that one would need to use FX mode to get insight into what is the first or last layer.
Alternatively, we could add a logged warning that we're defaulting to the default values if one chooses layerwise
quantization, as the first/last layer being treated differently may be unexpected behavior. On that note, does Brevitas have a logger that it uses? I can see that there's a logger defined in src/brevitas_examples/bnn_pynq/logger.py.
I've opened up a PR with some preliminary tests, and I will be adding more tests (e.g. whatever tests you guys want!). The ones that are currently failing are in some cases when I give invalid inputs (e.g. when I give zero-valued or negative-valued bit widths), where quantize_model
does not throw an error.
I've done a bit of testing, and negative/zero bit widths are considered valid if the model isn't used for anything. E.g.
quant_model = quantize_model(
model=fx_model,
backend='fx',
weight_bit_width=0, # NOTE: this is considered valid, which may be an issue
act_bit_width=0,
bias_bit_width=32,
weight_quant_granularity='per_tensor',
act_quant_percentile=99.9,
act_quant_type='sym',
scale_factor_type='float_scale',
quant_format='int',
)
first_conv_layer = quant_model.get_submodule('0')
print(first_conv_layer.weight_quant.tensor_quant.msb_clamp_bit_width_impl.bit_width._buffers['value'].item())
>> 0.0
If one feeds data through the model with zero bit-widths, it outputs NaNs. However, if one feeds data through a model with negative bit widths, it still outputs values.
I'm digging into this a bit more because I'm curious as to what's happening inside the model, but in either case I would imagine we should add add some asserts to make sure all provided bit widths are positive integers. I opened a PR for it. I'm not sure if the integer constraint I added is desired behavior, or if I should add it to other functions as well.
Here we keep track of what part of
quantize
inptq_common.py
are tested and what are still missing.