Xilinx / finn

Dataflow compiler for QNN inference on FPGAs
https://xilinx.github.io/finn
BSD 3-Clause "New" or "Revised" License
708 stars 225 forks source link

Problems with inference in PYNQ-Z1 and emulation stitched IP #855

Open ISPRrj opened 1 year ago

ISPRrj commented 1 year ago

Versions

Commit hash

commit e76f20d1d8d05f2d8ddb52ade0f991915672622b (HEAD -> dev, origin/dev) Merge: a3b6a7fb 3873325a Author: auphelia 56755897+auphelia@users.noreply.github.com Date: Tue Jul 11 10:21:26 2023 +0100

Merge pull request #852 from Xilinx/fix/alveo_build

Set axilite address range to a minimum of 4K

commit 3873325a31897b8ccbde9a211f90d5184338368e Author: auphelia jakobapk@web.de Date: Tue Jul 11 09:44:30 2023 +0100

[AlveoBuild] Set axilite address range to a minimum of 4K

commit a3b6a7fbbc70224571656242eff57bc452f6753f Merge: e56e8136 96fc4f57 Author: auphelia 56755897+auphelia@users.noreply.github.com Date: Mon Jul 10 09:14:01 2023 +0100

Merge pull request #844 from Xilinx/feature/2022_2

Dev PR to update Docker environment to Ubuntu 22, Python 3.10 and Xilinx tool version

commit 96fc4f57670811fafe1753a63bf0ccfc521da077 Author: auphelia jakobapk@web.de Date: Fri Jul 7 15:54:13 2023 +0100

[Deps] Update qonnx version

commit 7924bf7271b41dd808feac0e8c5017222490f553 Author: auphelia jakobapk@web.de Date: Fri Jul 7 14:31:14 2023 +0100

[NBs] Update notebooks to only use QONNX export

commit 391cd76ee3edb6e802d9b565a99993c775cc2194 Author: auphelia jakobapk@web.de Date: Fri Jul 7 12:07:42 2023 +0100

[deps] Bump clize to 5.0.1 and sigtools to 4.0.1

commit a48b5037871468e8a3e890b4719258c7dd1736e2 Author: auphelia jakobapk@web.de Date: Thu Jul 6 16:50:29 2023 +0100

[Tests] Update tests to only use qonnx export

commit 0cd757fbdabea18779f5374842b45a4fd755db10 Author: auphelia jakobapk@web.de Date: Thu Jul 6 15:50:01 2023 +0100

Quick summary

I am trying to implement the Lenet5 network on the PYNQ-Z1 board. For that purpose I have created the network using brevitas and I have obtained the following accuracy after training (about 55%).

252278251-51745332-dc5f-4de9-a5db-7dff876e077f

I have followed all the steps of the FINN end-to-end flow and even all the intermediate checks (including emulation via PyVerilator).

At first I thought that all the intermediate checks were working correctly and I performed the deployment on the PYNQ board getting only 8% accuracy well below the 55% obtained with brevitas.

But the other day I realised that when performing the stitched IP emulation I always get the same output value, regardless of the input value.

Details

I'm using the next dataset: https://storage.googleapis.com/download.tensorflow.org/example_images/flower_photos.tgz

Steps to Reproduce

Add what needs to be done to reproduce the bug. Add code examples where useful and make sure to include the resulting ONNX files, and the commit hash you are working on.

  1. I create the lenet network in brevitas ( Note that I am using QuantIdentity with a bit width of 8 at the beginning and I am using biasing, except in the last layer I am not using biasing to avoid problems in the subsequent transformations to HLS layers)
BIT_WIDTH=2;

class QuantWeightActBiasLeNet(Module):
    def __init__(self):
        super(QuantWeightActBiasLeNet, self).__init__()
        self.quant_inp = qnn.QuantIdentity(bit_width=8, return_quant_tensor=True)
        self.conv1 = qnn.QuantConv2d(3, 6, 5, bias=True, weight_bit_width=BIT_WIDTH)
        self.relu1 = qnn.QuantReLU(bit_width=BIT_WIDTH, return_quant_tensor=True)
        self.conv2 = qnn.QuantConv2d(6, 16, 5, bias=True, weight_bit_width=BIT_WIDTH)
        self.relu2 = qnn.QuantReLU(bit_width=BIT_WIDTH, return_quant_tensor=True)
        self.fc1   = qnn.QuantLinear(16*5*5, 120, bias=True, weight_bit_width=BIT_WIDTH)
        self.relu3 = qnn.QuantReLU(bit_width=BIT_WIDTH, return_quant_tensor=True)
        self.fc2   = qnn.QuantLinear(120, 84, bias=True, weight_bit_width=BIT_WIDTH)
        self.relu4 = qnn.QuantReLU(bit_width=BIT_WIDTH, return_quant_tensor=True)
        self.fc3   = qnn.QuantLinear(84, 5, bias=False, weight_bit_width=BIT_WIDTH)

    def forward(self, x):
        out = self.quant_inp(x)
        out = self.relu1(self.conv1(out))
        out = F.max_pool2d(out, 2)
        out = self.relu2(self.conv2(out))
        out = F.max_pool2d(out, 2)
        out = torch.flatten(out,1)
        out = self.relu3(self.fc1(out))
        out = self.relu4(self.fc2(out))
        out = self.fc3(out)

        return out
  1. Network training

  2. Brevitas export

    ready_model_filename = "Lenet_quantized.onnx"
    export_qonnx(model,torch.randn(1,3,32,32), ready_model_filename)
    qonnx_cleanup(ready_model_filename, out_file=ready_model_filename)
  3. Tidy up, pre and post processing.

PREPOST

  1. Lowering and streamlined transformations
    
    model = ModelWrapper("lenet_quantized_pre_post.onnx")
    model = model.transform(MoveScalarLinearPastInvariants())
    model = model.transform(Streamline())
    model = model.transform(LowerConvsToMatMul())
    model = model.transform(MakeMaxPoolNHWC())
    model = model.transform(absorb.AbsorbTransposeIntoMultiThreshold())

model = model.transform(MakeMaxPoolNHWC()) model = model.transform(absorb.AbsorbConsecutiveTransposes())

model = model.transform(Streamline())

model = model.transform(absorb.AbsorbScalarMulAddIntoTopK()) model = model.transform(InferDataLayouts()) model = model.transform(RemoveUnusedTensors()) model.save("lenet_quantized_streamlined.onnx")


![streamliNED](https://github.com/Xilinx/finn/assets/124183109/b390a47a-6296-4535-ba3a-629ec07f37e6)

6.Conversion to HLS layers

mem_mode = "decoupled"

model = model.transform(to_hls.InferBinaryMatrixVectorActivation(mem_mode)) model = model.transform(to_hls.InferQuantizedMatrixVectorActivation(mem_mode))

model = model.transform(to_hls.InferLabelSelectLayer()) model = model.transform(to_hls.InferThresholdingLayer()) model = model.transform(GiveUniqueNodeNames())

model = model.transform(to_hls.InferThresholdingLayer()) model = model.transform(to_hls.InferConvInpGen()) model = model.transform(to_hls.InferStreamingMaxPool())

model = model.transform(RemoveCNVtoFCFlatten())

model = model.transform(absorb.AbsorbConsecutiveTransposes())

model = model.transform(InferDataLayouts())

model.save("lenet_hls_layers.onnx")

![HLS](https://github.com/Xilinx/finn/assets/124183109/e2a2cac1-24b9-4d0d-9f4a-e8527d1b43e2)

7.Dataflow partitioning

8.Folding

model = ModelWrapper("flowers_dataflow_model.onnx") fc_layers = model.get_nodes_by_op_type("MatrixVectorActivation")

each tuple is (PE, SIMD, in_fifo_depth) for a layer

folding = [ (6, 3), (2, 6), (2, 2), (1, 1), (1, 1),

] for fcl, (pe, simd) in zip(fc_layers, folding): fcl_inst = getCustomOp(fcl) fcl_inst.set_nodeattr("PE", pe) fcl_inst.set_nodeattr("SIMD", simd)

use same SIMD values for the sliding window operators

swg_layers = model.get_nodes_by_op_type("ConvolutionInputGenerator") for i in range(len(swg_layers)): swg_inst = getCustomOp(swg_layers[i]) simd = folding[i][1] swg_inst.set_nodeattr("SIMD", simd)

model = model.transform(GiveUniqueNodeNames()) model.save("flowers_lenet_folded.onnx")


9. Simulation cppsim: works correctly

10. Emulation node by node PyVerilator: works correctly

11. Emulation stitched IP PyVerilator:  **PROBLEM: always get the same output value, regardless of the input value**

12. Deployment on PYNQ: **PROBLEM: 8% inference accuracy**
auphelia commented 1 year ago

Hi @ISPRrj ,

Could you also provide an example input .npy file with corresponding output reference .npy file? FINN expects integer values for all components, is your data set quantized or are you trying to use the first MultiThreshold for quantization?

ISPRrj commented 1 year ago

Hi @auphelia,

inputoutput.zip

In this zip file you can find the input.npy and the corresponding output.npy after applying the brevitas model.

To obtain the output of the brevitas model I have previously applied a normalization (to the input) by dividing by 255 because I have trained the brevitas network with normalized tensors (by doing the ToTensor() transformation). And I have also performed a reshape (1,3,32,32) to the input to have the dimensions expected by brevitas.

When I am performing the inference in FINN instead I am not performing any normalization to the data because this is already applied after the application of the preprocessing transformation.

As for the data set question, I have not performed any quantization. I was trying to use the first MultiThreshold for quantization.

fpjentzsch commented 1 year ago

Hi @ISPRrj,

what do you mean by the "application of the preprocessing transformation"? The division by 255 that normalizes UINT8 inputs to FLOAT [0,1]? Does that mean the primary input datatype (going into the first MultiThreshold) is annotated as UINT8?

I have two suggestions: 1) Do you transpose the input data before feeding it to the stitched-ip and hardware accelerator? The initial "Transpose" node after HLS conversion will not be handled by the accelerator. FINN's HLS layers all operate on NHWC data layout, while you train on NCHW. Reshaping will not be enough. 2) There might be something wrong with the way you normalize/quantize the input. Could you try to train directly on UINT8 inputs, so that you do not need the initial MultiThreshold and the input pixels (0-255) can be consumed without normalization by the first ConvolutionInputGenerator?

In general (for FLOAT inputs), I will use a QuantIdentity or similar as the input quantization node in Brevitas. Since FINN's MultiThreshold node does not support FLOAT inputs, I remove this node manually from the graph and do the input quantization somewhere else (in software). I know the exact quantization range by reading it from the nodes properties (if it was dynamically determined during training) or by setting a fixed range (using min_val, max_val, and scaling_impl_type = ScalingImplType.CONST).

ISPRrj commented 1 year ago

Hi @fpjentzsch, thank you very much for your feedback!

what do you mean by the "application of the preprocessing transformation"? The division by 255 that normalizes UINT8 inputs to FLOAT [0,1]? Does that mean the primary input datatype (going into the first MultiThreshold) is annotated as UINT8? Yes, I mean the division by 255 that normalizes UINT8 inputs to FLOAT [0,1]. But that does not mean that the first input datatype going into the first MultiThreshold is annotated as UINT8. In fact it is annotated as float32 as you can see in the image below. Captura

Do you transpose the input data before feeding it to the stitched-ip and hardware accelerator? Yes, I was aware of that and transposed the data before passing it to the accelerator.

There might be something wrong with the way you normalize/quantize the input. Could you try to train directly on UINT8 inputs, so that you do not need the initial MultiThreshold and the input pixels (0-255) can be consumed without normalization by the first ConvolutionInputGenerator? I have tried to perform the training directly with UINT8 but when I try to perform the inference with FINN I get the following error. Is this because I have to manually remove the initial Multithreshold? If so, how can I do it? Captura2 captura3

_In general (for FLOAT inputs), I will use a QuantIdentity or similar as the input quantization node in Brevitas. Since FINN's MultiThreshold node does not support FLOAT inputs, I remove this node manually from the graph and do the input quantization somewhere else (in software). I know the exact quantization range by reading it from the nodes properties (if it was dynamically determined during training) or by setting a fixed range (using min_val, max_val, and scaling_impltype = ScalingImplType.CONST). So in my case, would you change the QuantRelu for QuantIdentity?

Thank you very much for your time!

fpjentzsch commented 1 year ago

Hi,

I would not change the QuantRelu within your model, but you may try to set a fixed quantization range for the input quant node like this (maybe adjust it to use UINT8 instead of INT8):

from brevitas.inject.defaults import Int8ActPerTensorFloatMinMaxInit
from brevitas.inject.enum import ScalingImplType
    class InputQuantizer(Int8ActPerTensorFloatMinMaxInit):
        min_val = -config["in_quant_range"]
        max_val = config["in_quant_range"]
        scaling_impl_type = ScalingImplType.CONST
        bit_width = config["in_quant_bits"]
self.quant_inp = qnn.QuantHardTanh(act_quant=InputQuantizer, return_quant_tensor=True)

then you could remove the initial MultiThreshold manually like this (after converting from QONNX to FINN-ONNX):

first_node = model.graph.node[0]
if first_node.op_type == "MultiThreshold":
    quantized_input_dtype = model.get_tensor_datatype(first_node.output[0])
    # remove nodes up to first Mul (= MT + Add used for input quant)
    new_input_node = model.get_nodes_by_op_type("Mul")[0]
    new_input_tensor = model.get_tensor_valueinfo(new_input_node.input[0])
    old_input_tensor = model.graph.input[0]
    model.graph.input.remove(old_input_tensor)
    model.graph.input.append(new_input_tensor)
    model.graph.value_info.remove(new_input_tensor) # remove redundant value_info
    new_input_index = model.get_node_index(new_input_node)
    del model.graph.node[0:new_input_index]
    # make sure input datatype is set correctly
    model.set_tensor_datatype(model.graph.input[0].name, quantized_input_dtype)
else:
    model.set_tensor_datatype(model.graph.input[0].name, DataType["UINT8"])

In any case, the input data type should be properly annotated (see the last line). I'm a bit confused that you didn't run into other issues down the line while your input is set to FLOAT32...

ISPRrj commented 1 year ago

Hi @fpjentzsch !

I have made the changes and now I get the following error in the inference with finn:

InvalidArgument: [ONNXRuntimeError] : 2 : INVALID_ARGUMENT : Unexpected input data type. Actual: (tensor(uint8)) , expected: (tensor(float))

fpjentzsch commented 1 year ago

It looks like the actual tensor dataype is UINT8 now, so ONNXRuntime complains. The tensor datatype should still be float throughput the whole model, FINN uses it as a container datatype to hold integers. The true datatype from FINN's perspective is just annotated in the "quantization: finn_datatype" attribute, which is set using model.set_tensor_datatype(model.graph.input[0].name, DataType["UINT8"]) for example.

What if you cast the input tensor to float datatype before feeding it to the model?

ISPRrj commented 1 year ago

If I cast the input to float datatype before feeding to the model the inference is made. But now the problem I can see is that out of 100 test cases I am running, the inference of FINN and brevitas only match in 40.

rassB commented 11 months ago

Hello @ISPRrj @fpjentzsch , any developments regarding this issue? I have reproduced all the steps so far with quite a similar NN that runs 10% accuracy on MNIST hardware (always infers 0).

fpjentzsch commented 10 months ago

Hi @rassB, did you also run verification/simulation at different build steps to narrow down where the error is introduced? If the problem is related to input/output shaping and quantization, maybe our new tutorial notebook could be a useful resource, as it covers a custom build step to deal with 8-bit RGB inputs.

shakeelakram00 commented 6 months ago

Hi @fpjentzsch @rassB @ISPRrj @maltanar @auphelia , I have been closely following the ongoing discussion and encountered a similar accuracy discrepancy issue with my setup. Specifically, the accuracy of the Brevitas model is 86%, while the Accelerator on ZCU102 shows only 66%. After performing Initial Tidyup Transformations below, the accuracy of ONNX-converted model remains consistent with the brevitas model. However, applying the remaining transformations and building the accelerator results in a drop to 66% accuracy on the board.

Initial Tidyup Transformation: bo.export_finn_onnx(brevitas_model, (1, 1, 14, 14), "export.onnx"); model = ModelWrapper("export.onnx") model = model.transform(InferShapes()) model = model.transform(FoldConstants()) ... output_dict = oxe.execute_onnx(model_t, input_dict)

I am seeking assistance to identify and resolve this issue. Here are some details about my environment:

ZCU102: PYNQ Linux, based on Ubuntu 18.04 (GNU/Linux 4.19.0-xilinx-v2019.1 a) FINN: v0.9 Xilinx tools: 2022.2 Ubuntu: 22.04.1 LTS

I have been working with the cnv_end2end_example and successfully modified it to build the Accelerator on a different dataset. The brevitas model was trained on a dataset with a shape of 1x1x14x14 and dtype torch.float32.

Following the cnv_end2end_example, the first layer that exists does the quantization and the ONNX conversion includes pre-processing (ToTensor(), i.e., division by 255 for normalization UINT8 inputs to FLOAT [0,1]) and post-processing (TopK=1). The ONNX model, after create_dataflow_partition, provides all the blocks converted into HLS_Layers, except the initial Transpose.

Given that the first Transpose was not converted to an HLS layer, and the accelerator works with a dataset of shape 1x14x14x1 and dtype UINT8, I reshaped the dataset accordingly for inference on ZCU102 (1x14x14x1 and dtype np.uint8 (dataset*255.astype(np.uint8))).

Runtime_writeable_weights are enabled (set to 1) in the .json file for MVAU of CNV and Linear Layers, following the guidelines in 4_advanced_builder_settings and cnv-w1a1_folding_config.

I would appreciate any assistance in debugging this issue.

@fpjentzsch, you mentioned in your previous reply that reshaping alone might not be sufficient. Could you please provide further guidance, considering my specific setup, to achieve the desired accuracy on the accelerator?

Thank you in advance for your help.

Hi @ISPRrj, what do you mean by the "application of the preprocessing transformation"? The division by 255 that normalizes UINT8 inputs to FLOAT [0,1]? Does that mean the primary input datatype (going into the first MultiThreshold) is annotated as UINT8? I have two suggestions:

  1. Do you transpose the input data before feeding it to the stitched-ip and hardware accelerator? The initial "Transpose" node after HLS conversion will not be handled by the accelerator. FINN's HLS layers all operate on NHWC data layout, while you train on NCHW. Reshaping will not be enough.
  2. There might be something wrong with the way you normalize/quantize the input. Could you try to train directly on UINT8 inputs, so that you do not need the initial MultiThreshold and the input pixels (0-255) can be consumed without normalization by the first ConvolutionInputGenerator? In general (for FLOAT inputs), I will use a QuantIdentity or similar as the input quantization node in Brevitas. Since FINN's MultiThreshold node does not support FLOAT inputs, I remove this node manually from the graph and do the input quantization somewhere else (in software). I know the exact quantization range by reading it from the nodes properties (if it was dynamically determined during training) or by setting a fixed range (using min_val, max_val, and scaling_impl_type = ScalingImplType.CONST).
shakeelakram00 commented 5 months ago

Hi @fpjentzsch @rassB @ISPRrj @maltanar @auphelia @heborras @Tobi-Alonso @quetric @mmrahorovic @preusser, I've been diligently verifying each stage of FINN Flow for the above query, and I've run into a perplexing issue that I could use some guidance on.

Initially, during the ONNX execution, I achieved a commendable accuracy of 86% after applying tidy-up transformations, pre and post-processing transformations. However, upon proceeding with the streamline transformations, I encountered a significant drop in accuracy to 68%. This drop persisted when deploying the model onto an FPGA.

To give you a clearer picture, here are the streamline transformations I've implemented: model = model.transform(MoveScalarLinearPastInvariants()) model = model.transform(Streamline()) model = model.transform(LowerConvsToMatMul()) model = model.transform(MakeMaxPoolNHWC()) model = model.transform(Streamline()) model = model.transform(absorb.AbsorbTransposeIntoMultiThreshold()) model = model.transform(ConvertBipolarMatMulToXnorPopcount()) model = model.transform(Streamline()) model = model.transform(absorb.AbsorbScalarMulAddIntoTopK()) model = model.transform(InferDataLayouts()) model = model.transform(RemoveUnusedTensors())

I also tried the finn.builder.build_dataflow, it still showed the same issue i.e. when streamline transformations are applied there is a drop in accuracy.

Only when I take "model = model.transform(LowerConvsToMatMul())" this trasnformation off, I get the same 86% accuracy. And I know to convert the model to hls-compatible node we have to convert convs to matmul and we need this transformation. And the only difference other than this I see with and without transformation multithreshold_1 and multithreshold_2 finn_datatype are Binaray (with LowerConvsToMatMul: giving an accuracy of 68%), and are Bipolar (without LowerConvsToMatMul: giving an accuracy of 86%) respectively.

I'm at a loss as to why this transformation is causing such a significant accuracy drop. Is it due to the even Kernel Size i.e 6x6 I am using in quantconv2d? Any insights or suggestions you could offer would be greatly appreciated.

Thank you for your time and assistance.

joannapng commented 4 months ago

I had the same issue and I think the reason is the bias in the convolution layers. During the export to qonnx format, the bias quant initializer was not exported so the ExtractConvBias transformation that happens during the conversion from qonnx to finn-onnx failed to add an "Add" node in front of the "Conv" node. Check the logs to see if you encounter a "Could not extract bias from node" warning and remove the biases from the network if so.