Xilinx / finn-examples

Dataflow QNN inference accelerator examples on FPGAs
BSD 3-Clause "New" or "Revised" License
181 stars 59 forks source link

Residual structure cannot be converted to hls #67

Open TATynise opened 1 year ago

TATynise commented 1 year ago

Hi,I encountered "cycle-free graph violated: partition depends on itself" while running a custom network on finn.I have tried adjusting the streamlining and convert_to_hls steps according to ResNet-50 finn-example, but it still failed.

This is the residual part of the network:

image

Refer to "cnv_end2end_example",after streamline the residual part is as shown in the figure:

image

Refer to the "streamline nonlinear" step in ResNet50 finn-example, as shown in the figure:

image

Then converted to hls, as shown in the figure:

image When finally using "parent_model = model.transform(CreateDataflowPartition())", it failed because the residual part was not converted successfully.I have tried many ways but nothing works, I hope you can provide some guidance.

Thanks.

mmrahorovic commented 10 months ago

Hi @TATynise,

Thanks for your question!

Residual networks are indeed a bit tricky since it requires a streamlining process that's relatively more involved compared to linear networks. It looks like the streamlining process didn't 'fully streamline' the graph -- meaning you have a few floating point operators left in your network. In the final image you showed, you can see that the Mul and Add nodes (which are regular ONNX node) are mixed with the so-called fpgadataflow nodes (FMPadding_Batch, ConvolutionInputGenerator). The CreateDataflowPartition transform will partition your model in smaller sub-models, where each sub-model will consists of (exclusively) nodes that are either standard ONNX nodes or fpgadataflow-type nodes (i.e. nodes that will in the end run on the FPGA). Since your network is residual and contains many of these regular ONNX node mixed with fpgadataflow-type nodes, the partitioning becomes more complicated and breaks along the way somewhere.

To resolve this, I would first suggest to revisit the streamlining of your network, since I presume your target is to run the full network on the FPGA rather than partly. One trick to make this easier, is to add uniform quantizers at the end of both residual lanes in your custom network (before exporting it with Brevitas). In the third image you showed, this would result in having a MultiThreshold node at the end of both lanes. These MultiThreshold nodes are essentially what allows us to streamline away floating point operators by moving them around and absorbing them in those MultiThreshold thresholds. By then calling transforms such as AbsorbAddIntoMultiThreshold and AbsorbMulIntoMultiThreshold, those floating point operators will be absorbed in the thresholds of the subsequent MultiThreshold node.

This would remove the floating point operators you showed in the screenshots and bring you one step closer to full FPGA execution. Hope this helps you further and please let us know if you run into further issues!