Xilinx / brevitas

Brevitas: neural network quantization in PyTorch
https://xilinx.github.io/brevitas/
Other
1.2k stars 197 forks source link

AttributeError: 'NoneType' object has no attribute 'bit_width' #235

Closed Duchstf closed 3 years ago

Duchstf commented 3 years ago

I am trying to extract bias bit width from a QuantLinear layer. However, I'm encountering this error:

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-8-9b7c8f837ee6> in <module>
      3         print("######-------START LAYER-------#######")
      4         print("Layer's name: ", name)
----> 5         print("Bias bit width: ", layer.quant_bias_bit_width())
      6         print("Quant Weight Scale: ", float(layer.quant_weight_scale()))
      7         print("Fractional bit width: ", np.log2(float(layer.quant_weight_scale())))

~/.local/lib/python3.6/site-packages/brevitas/nn/mixin/parameter.py in quant_bias_bit_width(self)
    187         if self.bias is None:
    188             return None
--> 189         return self._cached_bias.bit_width
    190 

AttributeError: 'NoneType' object has no attribute 'bit_width'

This is how the layer was defined:

self.fc1   = QuantLinear(16, 64, bias=True, weight_bit_width=6, bias_bit_width=6,
                                weight_restrict_scaling_type = 'power_of_two',
                                bias_restrict_scaling_type = 'power_of_two')

Calling quant_weight_bit_width() is still working for me, so I'm not really sure where this problem comes from ...

volcacius commented 3 years ago

Hello,

There are a couple of things. In general bias quantization is more complicated to handle compared to weights because different target platforms make different assumptions about where scale and bit width are coming from. Because of that if you look at the default values of QuantLinear, you can see that by default bias_quant is kept in floating point, so it's disabled. The two keyword arguments you are passing (bias_bit_width and bias_restrict_scaling_type) are not enough to form a quantizer, so they are simply being ignored. To fix you can first set a predefined int quantizer as bias quantizer (in your case I'd suggest the one with internal scale and bit width, to avoid dependencies on previous layers) and then pass additional keyword arguments to adjust it for your own requirements. Keyword arguments override corresponding pre-existing values in the injected quantizer, so for example if you have bit_width = 8 in Int8BiasPerTensorFloatInternalScaling but you pass bias_bit_width=6, the resulting bit_width is gonna be 6.

from brevitas.nn import QuantLinear
from brevitas.inject.defaults import Int8BiasPerTensorFloatInternalScaling

self.fc1 = QuantLinear(16, 64, bias=True, 
weight_bit_width=6, weight_restrict_scaling_type='power_of_two', 
bias_quant=Int8BiasPerTensorFloatInternalScaling, bias_bit_width=6, bias_restrict_scaling_type='power_of_two')

The retrieval of scale and bit_width is also complicated by the fact that there are a lot of possible ways to define them. I just pushed a fix to master that should make it straightforward for the scenario above, so simply invoking self.fc1.quant_bias_scale() and self.fc1.quant_bias_bit_width() should work. Otherwise previously you would have had to enabled caching of the values and do an inference pass first.

Let me know if it works for you.

Alessandro

Duchstf commented 3 years ago

@volcacius Thank you so much, it's working now!!

On a related note, as you know I'm still trying to port a Brevitas model into hls4ml. So I'm wondering if you have recommendation for a simple MLP example (that should be very straightforward to extract total bit width/fractional bits) to target first. And then I can take baby steps to generalize more from there.

For your reference, this is kind of a toy 3-layer MLP model I'm creating for myself (based on your suggestions so far)

from brevitas.nn import QuantIdentity, QuantReLU, QuantLinear
from brevitas.inject.defaults import Int8BiasPerTensorFloatInternalScaling
from brevitas.core.quant import QuantType

class QDucNet(Module):
    def __init__(self):
        super(QDucNet, self).__init__()
        # Model with <16,64,32,32,5> Behavior
        #self.quant_inp = QuantIdentity(bit_width=8)
        self.fc1   = QuantLinear(16, 64, bias=True, 
                                 weight_bit_width=6,
                                 weight_restrict_scaling_type='power_of_two', 
                                 bias_quant=Int8BiasPerTensorFloatInternalScaling,
                                 bias_bit_width=6,
                                 bias_restrict_scaling_type='power_of_two')

        self.relu1 = QuantReLU(bit_width=8, act_restrict_scaling_type = 'power_of_two')

        self.fc2   = QuantLinear(64, 32, bias=True, 
                                 weight_bit_width=6,
                                 weight_restrict_scaling_type='power_of_two', 
                                 bias_quant=Int8BiasPerTensorFloatInternalScaling,
                                 bias_bit_width=6,
                                 bias_restrict_scaling_type='power_of_two')

        self.relu2 = QuantReLU(bit_width=8, act_restrict_scaling_type = 'power_of_two')

        self.fc3   = QuantLinear(32, 32, bias=True, 
                                 weight_bit_width=6,
                                 weight_restrict_scaling_type='power_of_two', 
                                 bias_quant=Int8BiasPerTensorFloatInternalScaling,
                                 bias_bit_width=6,
                                 bias_restrict_scaling_type='power_of_two')

        self.relu3 = QuantReLU(bit_width=8, act_restrict_scaling_type = 'power_of_two')

        self.fc4   = QuantLinear(32, 5, bias=True, 
                                 weight_bit_width=6,
                                 weight_restrict_scaling_type='power_of_two', 
                                 bias_quant=Int8BiasPerTensorFloatInternalScaling,
                                 bias_bit_width=6,
                                 bias_restrict_scaling_type='power_of_two')

        self.relu4 = QuantReLU(bit_width=8, act_restrict_scaling_type = 'power_of_two')
volcacius commented 3 years ago

Hi Duc,

The MLP you have there looks fine. There is only a small correction to make, quantized activations kwargs don't require a prefix, so it's just restrict_scaling_type and not act_restrict_scaling_type. Unfortunately I still don't have a way to intercept when kwargs are mistyped and left unused. Other than that, there are extra steps you could do to make the export easier, but it depends all on what are hls4ml requirements and how you are implementing the export. If you have an example network + export flow from a different framework (I guess QKeras?) I can tell you how you could achieve the same thing in Brevitas.

Alessandro

Duchstf commented 3 years ago

Hello,

Thank you for the inputs!

I don't really have a clean example for QKeras export now. But the Qkeras models are typically saved as json files (for architectures) and h5 files for weights. This makes them (and Keras models in general) very easy to handle since you essentially can load the model independently from any of Keras saving/loading mechanisms.

I think Pytorch is a bit more complicated. For one thing, they usually just save the module's information, so you can't really figure out the architecture just by looking at the saved .pth files. Also I think to load a Pytorch model you typically have to define the model class yourself.

Therefore, my current idea is to do the conversion directly from Python API. I.e the user would define and load their own model and pass the whole model into hls4ml.

After that, hls4ml would loop through each module using model.named_modules() and extract the relevant parameters. This is why I have been trying to access the quantization information through the layer's class. The idea is to compute the total bit widths & integer bits to use in each layer's 'ap_fixed<{},{}> precision config. Here is an example of how it's done in QKeras.

I hope that makes sense, and let me know if you have any suggestions!

Thank you,

Duc.

volcacius commented 3 years ago

Hi Duc,

Yes that makes sense. We struggled a lot with this too in the past, and for us the solution was to first export the model from Pytorch to ONNX (with custom layers and annotations to account for quantization) and then work from there. This way it's much more similar to how you are describing the QKeras flow. You don't even need to get to ONNX if you don't want to, Pytorch's ONNX export flow is based on the graph representation that they introduced in Pytorch 1.0 (together with the just-in-time compiler) so you could simply work off the Pytorch graph. This is what happens for example for Pytorch to TVM or to visualize the Pytorch model in Tensorboard.

What we do to expose quantization information to graph mode is to go through an alternative forward pass with symbolic operators (see for example) that captures the required data in very straightforward manner, so that it directly shows up in the ONNX/Pytorch graph export. Otherwise if you exported the QAT graph directly, you would find yourself with a very complicated structure that is hard to understand and not really required once training is over.

You can approach it like you are saying by going through named_modules(), but it won't scale. The first issue is that there isn't really any information there of how modules are connected to each other, so for anything that hasn't a very simple and linear structure you'll struggle to make sense of the dataflow. The other thing is that functional operators won't be captured. In practice those are typically torch.add or torch.cat in residual topologies.

My suggestion is really to go for some variant of the graph export in the long term. I'm happy to support the effort from my side.

Alessandro

Duchstf commented 3 years ago

Thank you! I'll pivot my efforts towards ONNX for hls4ml then!