Design built using RTL backend produces incorrect results

LinusJungemann commented 4 months ago

Quick summary

When using the RTL backend for synthesizing a simple 2 layer MLP, the rtlsim fails the verification step and the accelerator produces incorrect results.

Details

Steps to Reproduce

Clone the FINN repository
Checkout the dev branch, with commit hash: f649cdad8e08602ac7d6c353b4c12a3e25694c31
Use provided model and configuration: minimal.tar.gz
Start the docker container with the command: ./run-docker.sh build_custom minimal (Noctua 2 setup in Paderborn, so instead of docker, singularity is used, but this should not have an impact on the result.)

build_RTLVARIANT.py and build_HLSVARIANT.py can be used to select rtl or hls. The selected file needs to be renamed to build.py and placed in the minimal build folder. The difference between both build.py files is, that the rtl variant uses the provided specialize_layers.json.

Expected behavior

Both backends produce the same results.

Actual behavior

The build using the hls backend produces correct results (+ all verification steps succeed), build using rtl backend produces accelerator with incorrect results (+ rtlsim step fails with incorrect results)

fpjentzsch commented 4 months ago

I quickly ran your code.

The offending layer seems to be the second MVAU (i.e., it works with all layers but this one switched to RTL). Btw, even your HLS variant ends up with the Thresholding and FIFOs in RTL mode. The second MVAU has the following config:

Unfortunately, the full output context of the STITCHED_IP_RTLSIM verification is not helpful here because all activation tensors within the "GenericPartition_0" (the actual FINN IP which is surrounded by a MultiThreshold before it (why is this not mapped to the back-end?) and a Mul + Add after it) are reported as [0, ..., 0]. Is this a known issue, @auphelia?

auphelia commented 4 months ago

Yes, the stitched IP mode can only see the input and the output to the dataflow partition. @hannahxy13 has opened a PR that adds node-by-node rtlsim as a verification step for the builder, in this PR: https://github.com/Xilinx/finn/pull/1087 It is possible to either rebuild the MVAUs in isolation or enabling the verification flow from the PR above could give more insights.

auphelia commented 4 months ago

This patch fixes your issue: https://github.com/Xilinx/finn/pull/1135

LinusJungemann commented 4 months ago

Yes, it works now. Thanks!

Xilinx / finn