Xilinx / finn

Dataflow compiler for QNN inference on FPGAs
https://xilinx.github.io/finn
BSD 3-Clause "New" or "Revised" License
748 stars 239 forks source link

Design built using RTL backend produces incorrect results #1130

Closed LinusJungemann closed 4 months ago

LinusJungemann commented 4 months ago

Quick summary

When using the RTL backend for synthesizing a simple 2 layer MLP, the rtlsim fails the verification step and the accelerator produces incorrect results.

Details

Steps to Reproduce

  1. Clone the FINN repository
  2. Checkout the dev branch, with commit hash: f649cdad8e08602ac7d6c353b4c12a3e25694c31
  3. Use provided model and configuration: minimal.tar.gz
  4. Start the docker container with the command: ./run-docker.sh build_custom minimal (Noctua 2 setup in Paderborn, so instead of docker, singularity is used, but this should not have an impact on the result.)

build_RTLVARIANT.py and build_HLSVARIANT.py can be used to select rtl or hls. The selected file needs to be renamed to build.py and placed in the minimal build folder. The difference between both build.py files is, that the rtl variant uses the provided specialize_layers.json.

Expected behavior

Both backends produce the same results.

Actual behavior

The build using the hls backend produces correct results (+ all verification steps succeed), build using rtl backend produces accelerator with incorrect results (+ rtlsim step fails with incorrect results)

fpjentzsch commented 4 months ago

I quickly ran your code.

The offending layer seems to be the second MVAU (i.e., it works with all layers but this one switched to RTL). Btw, even your HLS variant ends up with the Thresholding and FIFOs in RTL mode. The second MVAU has the following config: image

Unfortunately, the full output context of the STITCHED_IP_RTLSIM verification is not helpful here because all activation tensors within the "GenericPartition_0" (the actual FINN IP which is surrounded by a MultiThreshold before it (why is this not mapped to the back-end?) and a Mul + Add after it) are reported as [0, ..., 0]. Is this a known issue, @auphelia?

auphelia commented 4 months ago

Yes, the stitched IP mode can only see the input and the output to the dataflow partition. @hannahxy13 has opened a PR that adds node-by-node rtlsim as a verification step for the builder, in this PR: https://github.com/Xilinx/finn/pull/1087 It is possible to either rebuild the MVAUs in isolation or enabling the verification flow from the PR above could give more insights.

auphelia commented 4 months ago

This patch fixes your issue: https://github.com/Xilinx/finn/pull/1135

LinusJungemann commented 4 months ago

Yes, it works now. Thanks!