Closed litghost closed 5 years ago
It is worth noting that the BEL timing numbers are still total non-sense, and the clusterer is not timing based.
Enabling timing based placing causes an error:
vtr-verilog-to-routing/libs/EXTERNAL/libtatum/libtatum/tatum/timing_paths.cpp:45 find_critical_paths: Assertion 'slack.valid()' failed.
Fixing this is the next step.
Interesting, the slack error does not occur with the counter test. I wonder if it is something specific to LUT-RAM usage.
Counter test values:
Real route timing + placer in timing mode:
Worst setup slack: {CLBLL_R_X17Y110_SLICE_X26Y110_C_FDRE/C --> CLBLL_R_X17Y116_SLICE_X26Y116_D_FDRE/D} = 5.942
Worst hold slack: {CLBLL_R_X17Y110_SLICE_X26Y110_A_FDRE/C --> CLBLL_R_X17Y110_SLICE_X26Y110_A_FDRE/D} = 0.587
Real route timing + placer in bounding box mode:
Worst setup slack: {CLBLL_R_X17Y114_SLICE_X26Y114_C_FDRE/C --> CLBLL_R_X17Y120_SLICE_X26Y120_D_FDRE/D} = 6.026
Worst hold slack: {CLBLL_R_X17Y114_SLICE_X26Y114_A_FDRE/C --> CLBLL_R_X17Y114_SLICE_X26Y114_A_FDRE/D} = 0.587
So effectively no change between bounding box and timing based placer. But that might be because the counter tests restricts placement through the use of the counter carry chain. The real question is why does timing based placing work on the counter test, but not the dram 64x1d test?
I believe I've determined what is causing the timing based placement to fail, the delay model is sometimes returning "inf" on valid routes. Presumably the "inf" when combined with the timing model results in a NaN, and an invalid slack. I've already filled an issue with VTR about the delay matrix problem, but I'll add details.
After modifying the placer to always return 0 delays if the delays inf, new timing results are:
dram_test_64x1d, using "Real" intrinsic delays and some RC delay. Internal capacitance not modeled, and placer in timing mode (e.g. default --place_algorithm)
Worst setup slack: {CLBLL_R_X15Y135_SLICE_X22Y135_D5_FDRE/C --> CLBLL_R_X15Y135_SLICE_X22Y135_D5_FDRE/D} = -3.164
Worst hold slack: {CLBLL_R_X17Y127_SLICE_X26Y127_A_FDRE/C --> CLBLL_R_X19Y126_SLICE_X28Y126_B_FDRE/D} = 0.220
This is actually an improvement. Next step will be to try to fill the delay matrix using a nearest neighbor approach, rather than just setting the value to 0.
I did a quick sanity check, and the BEL timing values for the LUT and FF are faster than reality. I wonder if making the timing values worse (e.g. more correct) might make the router/placer try harder?
I've added the CARRY4 blackbox timing, and now the dram_64x1d test correctly reports failing timing!
Final critical path: 11.5676 ns, Fmax: 86.4485 MHz
Setup Worst Negative Slack (sWNS): -1.56757 ns
Setup Total Negative Slack (sTNS): -22.7989 ns
For reference Vivado reports:
WNS(ns) TNS(ns)
------- -------
-3.929 -200.760
I'm going to next fix https://github.com/SymbiFlow/symbiflow-arch-defs/issues/578#issuecomment-493496093 so we can directly compare VPR's and Vivado's timing traces. Luckily the timing trace information between VPR and Vivado is very similar, except we don't name the BEL's the same. With a small amount of effort I hope we can get comparable timing traces.
If the VPR and Vivado trace agree on the net that is failing slack timing, then next step is to determine what changes are required to enable VPR to close timing. Interesting the routing in this case stopped immediately after clear routing congestion:
Confirming router algorithm: TIMING_DRIVEN.
---- ------ ------- ---- ------- ------- ------- ----------------- --------------- -------- ---------- ---------- ---------- ---------- --------
Iter Time pres BBs Heap Re-Rtd Re-Rtd Overused RR Nodes Wirelength CPD sTNS sWNS hTNS hWNS Est Succ
(sec) fac Updt push Nets Conns (ns) (ns) (ns) (ns) (ns) Iter
---- ------ ------- ---- ------- ------- ------- ----------------- --------------- -------- ---------- ---------- ---------- ---------- --------
Warning 4091: 17 timing startpoints were not constrained during timing analysis
Warning 4092: 17 timing endpoints were not constrained during timing analysis
1 5.1 0.0 0 3.7e+07 686 2421 715 ( 0.046%) 70519 ( 0.7%) 10.514 -3.692 -0.514 0.000 0.000 N/A
2 5.9 0.5 97 3.9e+07 449 1598 256 ( 0.017%) 71144 ( 0.7%) 10.439 -3.356 -0.439 0.000 0.000 N/A
3 4.1 0.6 61 2.5e+07 231 892 141 ( 0.009%) 72252 ( 0.7%) 10.641 -5.333 -0.641 0.000 0.000 N/A
4 5.2 0.8 42 3.1e+07 275 1188 121 ( 0.008%) 72903 ( 0.7%) 10.586 -5.130 -0.586 0.000 0.000 N/A
5 4.5 1.1 42 2.8e+07 239 1116 11 ( 0.001%) 73664 ( 0.7%) 10.615 -9.131 -0.615 0.000 0.000 N/A
6 4.4 1.4 32 2.7e+07 224 1073 4 ( 0.000%) 73858 ( 0.7%) 10.915 -23.76 -0.915 0.000 0.000 N/A
7 3.5 1.9 26 2.1e+07 220 1034 1 ( 0.000%) 73643 ( 0.7%) 11.290 -22.70 -1.290 0.000 0.000 N/A
8 3.4 2.4 18 2.1e+07 214 1006 0 ( 0.000%) 74052 ( 0.7%) 11.568 -22.80 -1.568 0.000 0.000 N/A
It's unclear why the router didn't attempt to close timing.
To confirm that VPR is not being given an impossible task, I double checked that the same post-synthesis output was given to Vivado and Vivado closed timing. Vivado result:
-------------------------------------------------------------------
required time 14.983
arrival time -14.845
-------------------------------------------------------------------
slack 0.138
Where as VPR was not:
-------------------------------------------------------------------------------------------------------------------------------------------
data required time 10.368
data arrival time -19.746
-------------------------------------------------------------------------------------------------------------------------------------------
slack (VIOLATED) -9.378
In this case, VPR's delay model is underestimating the slack violation, but the correct critical path was being tracked (in routing at least), so VPR was working with a reasonably accurate model of the system.
Next step is to file an issue with VTR devs exploring why VPR was unable to close timing.
VPR now closes timing when running with --astar_fac 0
, but takes an excessive amount of time (per https://github.com/verilog-to-routing/vtr-verilog-to-routing/issues/592). The problem was identified that the A* look ahead function was returning poor estimates, and as a result the router find a suboptimal path. Setting --astar_fac 0
turns off the heuristic, resulting in the router finding optimal paths, but taking a long time to do so.
One possible solution to this excessive runtime is to improve the quality of the look ahead heuristic.
Lookahead debugging is still continuing, with limited success.
Note: Now that route timing is present on master, VPR runtime has increased as a side affect of the poor lookahead:
Do we want to disable timing import for now?
I saw travis build times have more than doubled and sometimes timeout (50 minute limit). Although maybe we can add -j2
, trying in #815
With https://github.com/SymbiFlow/vtr-verilog-to-routing/pull/83 and https://github.com/SymbiFlow/symbiflow-arch-defs/pull/802 I've been able to reach a runtime/quality tradeoff that is starting to look reasonable:
A* fac | Fmax (Mhz) | Pack (sec) | Place (sec) | Route (sec) |
---|---|---|---|---|
0 | 98.1606 | 135.61 | 2529.12 | 342.71 |
1.2 | 91.6605 | 138.38 | 720.81 | 350.37 |
1.6 | 72.1894 | 136.93 | 324 | 378.17 |
2 | 77.6808 | 131.92 | 239.23 | 349.4 |
I am currently using the same A factor for both placement and routing, next week I will decouple them. If it turns out the placement delay matrix quality is not the sensitive factor, we can run the delay matrix generation with a higher A factor.
Across pack/place/route, VPR is recomputing the router lookahead, which is taking ~100 seconds each time. The place time variation is all in the time to build the placer delay matrix.
For curiosity's sake, I ran both murax and picosoc with A* fac = 0 to see how well VPR could do. Picosoc is running into a router infinite loop, but murax completed. Timing was not being closed, but to do a sanity check, I gave the same synthesis to Vivado and only did place and route, and VPR did pretty good:
Vivado:
Setup : 169 Failing Endpoints, Worst Slack -6.754ns, Total Violation -241.995ns
VPR:
Setup : 438 Failing Endpoints, Worst Slack -8.767ns, Total Violation -1156.880ns
While VPR is returning worse results, the gap is not that big. This instead points to a problem with the synthesis step, as even Vivado isn't closing timing.
For reference if Vivado is used for synthesis and place and route the design closes:
Setup : 0 Failing Endpoints, Worst Slack 2.409ns, Total Violation 0.000ns
Comparision between resource utilization:
Vivado synth:
+----------+------+---------------------+
| Ref Name | Used | Functional Category |
+----------+------+---------------------+
| FDRE | 814 | Flop & Latch |
| LUT6 | 413 | LUT |
| LUT5 | 199 | LUT |
| FDCE | 198 | Flop & Latch |
| LUT3 | 157 | LUT |
| LUT4 | 109 | LUT |
| LUT1 | 83 | LUT |
| LUT2 | 71 | LUT |
| CARRY4 | 50 | CarryLogic |
| RAMD32 | 24 | Distributed Memory |
| IBUF | 18 | IO |
| OBUF | 17 | IO |
| FDPE | 12 | Flop & Latch |
| RAMS32 | 8 | Distributed Memory |
| RAMB18E1 | 6 | Block Memory |
| MUXF7 | 6 | MuxFx |
| BUFG | 1 | Clock |
+----------+------+---------------------+
Yosys synth:
+----------+------+---------------------+
| Ref Name | Used | Functional Category |
+----------+------+---------------------+
| LUT6 | 1912 | LUT |
| LUT5 | 1912 | LUT |
| FDRE | 870 | Flop & Latch |
| FDCE | 198 | Flop & Latch |
| RAMD64E | 160 | Distributed Memory |
| CARRY4 | 49 | CarryLogic |
| MUXF7 | 20 | MuxFx |
| IBUF | 18 | IO |
| OBUF | 17 | IO |
| FDPE | 13 | Flop & Latch |
| MUXF8 | 9 | MuxFx |
| RAMB18E1 | 4 | Block Memory |
| BUFHCE | 1 | Clock |
| BUFGCTRL | 1 | Clock |
+----------+------+---------------------+
@litghost Can you run with Yosys for synthesis in both cases? The tcl that you need is found here -- also included below;
create_project -force -name top -part xc7a35t-csg324-1
read_xdc top.xdc
read_edif top.edif
link_design -top top -part xc7a35t-csg324-1
report_timing_summary -file top_timing_synth.rpt
report_utilization -hierarchical -file top_utilization_hierarchical_synth.rpt
report_utilization -file top_utilization_synth.rpt
opt_design
place_design
report_utilization -hierarchical -file top_utilization_hierarchical_place.rpt
report_utilization -file top_utilization_place.rpt
report_io -file top_io.rpt
report_control_sets -verbose -file top_control_sets.rpt
report_clock_utilization -file top_clock_utilization.rpt
route_design
phys_opt_design
report_timing_summary -no_header -no_detailed_paths
write_checkpoint -force top_route.dcp
report_route_status -file top_route_status.rpt
report_drc -file top_drc.rpt
report_timing_summary -datasheet -max_paths 10 -file top_timing.rpt
report_power -file top_power.rpt
set_property BITSTREAM.CONFIG.SPI_BUSWIDTH 4 [current_design]
write_bitstream -force top.bit
write_cfgmem -force -format bin -interface spix4 -size 16 -loadbit "up 0x0 top.bit" -file top.bin
quit
Sorry I jumped the gun - it looks like you have already done that.
@eddiehung - Could you look at what Yosys is doing here?
@eddiehung @mithro With the yosyshq/xiag branch merged Yosys now uses the following primatives:
$lut 1361
CARRY4_VPR 57
CE_VCC 433
DPRAM64 160
DRAM_2_OUTPUT_STUB 80
FDCE_ZINI 198
FDPE_ZINI 13
FDRE_ZINI 870
MUXF6 93
MUXF7 2
RAMB18E1_VPR 4
SR_GND 870
which looks much closer to the Vivado synth. I'll run Vivado over this synthesis to check timing closure.
Looks like the new yosys synthesis does a lot better in Vivado place and route:
Vivado reported usage
Report Cell Usage:
+------+---------+------+
| |Cell |Count |
+------+---------+------+
|1 |BUFG | 1|
|2 |CARRY4 | 57|
|3 |LUT1 | 32|
|4 |LUT2 | 285|
|5 |LUT3 | 259|
|6 |LUT4 | 272|
|7 |LUT5 | 327|
|8 |LUT6 | 93|
|9 |MUXF7 | 2|
|10 |RAM64X1D | 80|
|11 |RAMB18E1 | 4|
|12 |FDCE | 198|
|13 |FDPE | 13|
|14 |FDRE | 870|
|15 |IBUF | 18|
|16 |OBUF | 17|
+------+---------+------+
Timing summary:
---------------------------------------------------------------------------------------------------
From Clock: clk
To Clock: clk
Setup : 0 Failing Endpoints, Worst Slack 1.434ns, Total Violation 0.000ns
Hold : 0 Failing Endpoints, Worst Slack 0.071ns, Total Violation 0.000ns
PW : 0 Failing Endpoints, Worst Slack 3.750ns, Total Violation 0.000ns
---------------------------------------------------------------------------------------------------
So overall, the Yosys synthesis is still a little worse than the Vivado synthesis, but by a much smaller amount. Some of that different is likely due to the change in RAM behavior. For example, Yosys doesn't infer any RAM32 primatives, and Vivado choose to implement the register file in BRAM instead of LUT-RAM.
With the new synthesis output, VPR is still failing timing, even with A* factor = 0:
---------------------------------------------------------------------------------------------------
From Clock: io_mainClk
To Clock: io_mainClk
Setup : 342 Failing Endpoints, Worst Slack -4.497ns, Total Violation -540.910ns
Hold : 0 Failing Endpoints, Worst Slack 0.095ns, Total Violation 0.000ns
PW : 0 Failing Endpoints, Worst Slack 3.750ns, Total Violation 0.000ns
---------------------------------------------------------------------------------------------------
Next week I'll explore where gap is coming from. @HackerFoo suggested adding a congestion free placement analysis pass that might shed light into how the placer is doing independent of the router.
@litghost are you able to get the same Vivado table for the Yosys synthesis?
@litghost are you able to get the same Vivado table for the Yosys synthesis?
What do you mean? I believe you are talking about https://github.com/SymbiFlow/symbiflow-arch-defs/issues/716#issuecomment-501400873 ?
I've found that increase bb factor (e.g. expanding the bounding box) was the missing link something.
https://github.com/verilog-to-routing/vtr-verilog-to-routing/issues/658#issue-457755763 shows the current results, but it is possible that using very slow parameters, VPR might close timing on murax!
As of https://github.com/SymbiFlow/symbiflow-arch-defs/pull/986, all xc7 targets are closing at 50 MHz.
As we add better timing information, this bug is for tracking timing quality from VPR on xc7.
Initial result on dram_test_64x1d:
Invalid timing values (used as of 6cd461cc420c9a00a560e489d1f9e4d655b2f9c4 and before):
Zero delay routing:
"Real" intrinsic delays and some RC delay. Internal capacitance not modeled, and placer in bounding box mode (e.g.
--place_algorithm bounding_box
)Next step is to see if enabling timing based placer helps.