ferrandi / PandA-bambu

PandA-bambu public repository
GNU General Public License v3.0
244 stars 48 forks source link

Timing issue on Xilinx backend and sdc scheduling issue #343

Open sheldonz7 opened 2 weeks ago

sheldonz7 commented 2 weeks ago

Dear Bambu team, When using RTL designs generated by Bambu, I constantly get very tight timing after Vivado implementaion(post route timing report), much worse than what Bambu estimated during its backend run. Setting cprf of less than 1 helps with the Bambu estimation but not the implementation result.

With a clock period constraint of 10ns, this is one sample result i get: cprf = 1 Bambu estimate: 9.836 ns Vivado implementation report: 10.469 ns cprf = 0.7 Bambu estimate: 5.275 ns Vivado implementation report: 9.604 ns

Is this expected or do I have some ways to bring it down? I tried to use pipeline floating point units, which doesn't seem to help.

I tried to switch to speculative sdc scheduling by including the -s option, but i receive errors during run, as provided in this file: stderr.txt

For these experiments, this is the command i use:

bambu --top-fname=k2mm --print-dot --compiler=I386_CLANG13 -O2 --debug 4 --verbosity 4 --device=xcku060-3ffva1156-VVD --clock-period=10 --disable-function-proxy --generate-interface=INFER -s ../k2mm.c

this is the C code I used: k2mm.c.txt k2mm.h.txt

fabrizioferrandi commented 1 week ago

A first quick comment. --debug 4 is not working very well. Please use --debug-classes instead. This option raises the debug level for a specific calls/step. An example is --debug-classes= parametric_list_based,cdfc_module_binding, where the debug level is raised to the maximum verbosity for the list-based scheduling and the module binding steps.

fabrizioferrandi commented 1 week ago

The Bambu default is with resource sharing (-C='*'). Sharing the resource may create timing issues. One solution could be to register the input functions. Adding --registered-inputs=yes fixed the timing issue. Another one is to increase the number of resources. In your case, the critical path goes through the function float_adde8m23b_127nih. So you could say -C=float_adde8m23b_127nih=2 to use two fp adders instead of one. The first solution increases the latency, while the second increases the area usage.

fabrizioferrandi commented 1 week ago

Regarding the -s option, it matters only if you have if statements since it activates the code motion/speculation sdc-based scheduling.

sheldonz7 commented 1 week ago

Hi, thank you for the clarification! Regarding resource sharing, I'm already using --disable-function-proxy which should permit more than 1 functional units for operations like fadd right? are you saying using -C to enforce the number of specific functional units being generated in the design rather than relying on module binding (which I believe generates the minimum number of functional units that is necessary).

Ansaya commented 1 week ago

Hi, --disable-function-proxy and -C are two separate options not to be confused.

Function proxies are used to share a hardware function between multiple callers. As an example, say you have three functions A, B, and C, where A and B both call C: with function proxies enabled, the tool will generate a single instance of C and use it for both A and B, while using --disable-function-proxy will result in two dedicated instances of C, one for A and one for B.

Setting the number of functional units for a given operation means that you are allowing the module binding to use a given number of functional units instead of a single one.