[XC7] Timing quality tracking bug

litghost commented 5 years ago

As we add better timing information, this bug is for tracking timing quality from VPR on xc7.

Initial result on dram_test_64x1d:

Invalid timing values (used as of 6cd461cc420c9a00a560e489d1f9e4d655b2f9c4 and before):

Worst setup slack: {CLBLL_R_X15Y112_SLICE_X22Y112_C_FDRE/C --> CLBLL_R_X15Y111_SLICE_X23Y111_D5_FDRE/CE} = -3.794
Worst hold slack: {CLBLL_R_X21Y110_SLICE_X32Y110_C_FDRE/C --> CLBLM_L_X20Y110_SLICE_X30Y110_RAM64X1D_CD/DP/WADR1} = 0.068

Zero delay routing:

Worst setup slack: {CLBLL_L_X14Y112_SLICE_X20Y112_C_FDRE/C --> CLBLL_L_X16Y109_SLICE_X25Y109_D5_FDRE/CE} = -3.352
Worst hold slack: {CLBLL_R_X17Y133_SLICE_X26Y133_D_FDRE/C --> CLBLL_R_X17Y132_SLICE_X27Y132_D_FDRE/D} = 0.181

"Real" intrinsic delays and some RC delay. Internal capacitance not modeled, and placer in bounding box mode (e.g. --place_algorithm bounding_box)

Worst setup slack: {CLBLL_R_X17Y120_SLICE_X27Y120_C_FDRE/C --> CLBLL_L_X16Y136_SLICE_X24Y136_D5_FDRE/CE} = -3.404
Worst hold slack: {CLBLL_R_X17Y119_SLICE_X27Y119_D_FDRE/C --> CLBLM_L_X22Y119_SLICE_X34Y119_RAM64X1D_CD/DP/WADR3} = 0.139

Next step is to see if enabling timing based placer helps.

litghost commented 5 years ago

It is worth noting that the BEL timing numbers are still total non-sense, and the clusterer is not timing based.

litghost commented 5 years ago

Enabling timing based placing causes an error:

vtr-verilog-to-routing/libs/EXTERNAL/libtatum/libtatum/tatum/timing_paths.cpp:45 find_critical_paths: Assertion 'slack.valid()' failed.

Fixing this is the next step.

litghost commented 5 years ago

Interesting, the slack error does not occur with the counter test. I wonder if it is something specific to LUT-RAM usage.

litghost commented 5 years ago

Counter test values:

Real route timing + placer in timing mode:

Worst setup slack: {CLBLL_R_X17Y110_SLICE_X26Y110_C_FDRE/C --> CLBLL_R_X17Y116_SLICE_X26Y116_D_FDRE/D} = 5.942
Worst hold slack: {CLBLL_R_X17Y110_SLICE_X26Y110_A_FDRE/C --> CLBLL_R_X17Y110_SLICE_X26Y110_A_FDRE/D} = 0.587

Real route timing + placer in bounding box mode:

Worst setup slack: {CLBLL_R_X17Y114_SLICE_X26Y114_C_FDRE/C --> CLBLL_R_X17Y120_SLICE_X26Y120_D_FDRE/D} = 6.026
Worst hold slack: {CLBLL_R_X17Y114_SLICE_X26Y114_A_FDRE/C --> CLBLL_R_X17Y114_SLICE_X26Y114_A_FDRE/D} = 0.587

So effectively no change between bounding box and timing based placer. But that might be because the counter tests restricts placement through the use of the counter carry chain. The real question is why does timing based placing work on the counter test, but not the dram 64x1d test?

litghost commented 5 years ago

I believe I've determined what is causing the timing based placement to fail, the delay model is sometimes returning "inf" on valid routes. Presumably the "inf" when combined with the timing model results in a NaN, and an invalid slack. I've already filled an issue with VTR about the delay matrix problem, but I'll add details.

litghost commented 5 years ago

After modifying the placer to always return 0 delays if the delays inf, new timing results are:

dram_test_64x1d, using "Real" intrinsic delays and some RC delay. Internal capacitance not modeled, and placer in timing mode (e.g. default --place_algorithm)

Worst setup slack: {CLBLL_R_X15Y135_SLICE_X22Y135_D5_FDRE/C --> CLBLL_R_X15Y135_SLICE_X22Y135_D5_FDRE/D} = -3.164
Worst hold slack: {CLBLL_R_X17Y127_SLICE_X26Y127_A_FDRE/C --> CLBLL_R_X19Y126_SLICE_X28Y126_B_FDRE/D} = 0.220

This is actually an improvement. Next step will be to try to fill the delay matrix using a nearest neighbor approach, rather than just setting the value to 0.

litghost commented 5 years ago

I did a quick sanity check, and the BEL timing values for the LUT and FF are faster than reality. I wonder if making the timing values worse (e.g. more correct) might make the router/placer try harder?

litghost commented 5 years ago

I've added the CARRY4 blackbox timing, and now the dram_64x1d test correctly reports failing timing!

Final critical path: 11.5676 ns, Fmax: 86.4485 MHz
Setup Worst Negative Slack (sWNS): -1.56757 ns
Setup Total Negative Slack (sTNS): -22.7989 ns

For reference Vivado reports:

    WNS(ns)      TNS(ns)

    -------      ------- 
     -3.929     -200.760

I'm going to next fix https://github.com/SymbiFlow/symbiflow-arch-defs/issues/578#issuecomment-493496093 so we can directly compare VPR's and Vivado's timing traces. Luckily the timing trace information between VPR and Vivado is very similar, except we don't name the BEL's the same. With a small amount of effort I hope we can get comparable timing traces.

If the VPR and Vivado trace agree on the net that is failing slack timing, then next step is to determine what changes are required to enable VPR to close timing. Interesting the routing in this case stopped immediately after clear routing congestion:

Confirming router algorithm: TIMING_DRIVEN.
---- ------ ------- ---- ------- ------- ------- ----------------- --------------- -------- ---------- ---------- ---------- ---------- --------
Iter   Time    pres  BBs    Heap  Re-Rtd  Re-Rtd Overused RR Nodes      Wirelength      CPD       sTNS       sWNS       hTNS       hWNS Est Succ
      (sec)     fac Updt    push    Nets   Conns                                       (ns)       (ns)       (ns)       (ns)       (ns)     Iter
---- ------ ------- ---- ------- ------- ------- ----------------- --------------- -------- ---------- ---------- ---------- ---------- --------
Warning 4091: 17 timing startpoints were not constrained during timing analysis
Warning 4092: 17 timing endpoints were not constrained during timing analysis
   1    5.1     0.0    0 3.7e+07     686    2421     715 ( 0.046%)   70519 ( 0.7%)   10.514     -3.692     -0.514      0.000      0.000      N/A
   2    5.9     0.5   97 3.9e+07     449    1598     256 ( 0.017%)   71144 ( 0.7%)   10.439     -3.356     -0.439      0.000      0.000      N/A
   3    4.1     0.6   61 2.5e+07     231     892     141 ( 0.009%)   72252 ( 0.7%)   10.641     -5.333     -0.641      0.000      0.000      N/A
   4    5.2     0.8   42 3.1e+07     275    1188     121 ( 0.008%)   72903 ( 0.7%)   10.586     -5.130     -0.586      0.000      0.000      N/A
   5    4.5     1.1   42 2.8e+07     239    1116      11 ( 0.001%)   73664 ( 0.7%)   10.615     -9.131     -0.615      0.000      0.000      N/A
   6    4.4     1.4   32 2.7e+07     224    1073       4 ( 0.000%)   73858 ( 0.7%)   10.915     -23.76     -0.915      0.000      0.000      N/A
   7    3.5     1.9   26 2.1e+07     220    1034       1 ( 0.000%)   73643 ( 0.7%)   11.290     -22.70     -1.290      0.000      0.000      N/A
   8    3.4     2.4   18 2.1e+07     214    1006       0 ( 0.000%)   74052 ( 0.7%)   11.568     -22.80     -1.568      0.000      0.000      N/A

It's unclear why the router didn't attempt to close timing.

litghost commented 5 years ago

To confirm that VPR is not being given an impossible task, I double checked that the same post-synthesis output was given to Vivado and Vivado closed timing. Vivado result:

  -------------------------------------------------------------------
                         required time                         14.983    
                         arrival time                         -14.845    
  -------------------------------------------------------------------
                         slack                                  0.138

Where as VPR was not:

-------------------------------------------------------------------------------------------------------------------------------------------
data required time                                                                                                                   10.368
data arrival time                                                                                                                   -19.746
-------------------------------------------------------------------------------------------------------------------------------------------
slack (VIOLATED)                                                                                                                     -9.378

In this case, VPR's delay model is underestimating the slack violation, but the correct critical path was being tracked (in routing at least), so VPR was working with a reasonably accurate model of the system.

Next step is to file an issue with VTR devs exploring why VPR was unable to close timing.

litghost commented 5 years ago

VPR now closes timing when running with --astar_fac 0, but takes an excessive amount of time (per https://github.com/verilog-to-routing/vtr-verilog-to-routing/issues/592). The problem was identified that the A* look ahead function was returning poor estimates, and as a result the router find a suboptimal path. Setting --astar_fac 0 turns off the heuristic, resulting in the router finding optimal paths, but taking a long time to do so.

One possible solution to this excessive runtime is to improve the quality of the look ahead heuristic.

litghost commented 5 years ago

Lookahead debugging is still continuing, with limited success.

litghost commented 5 years ago

Note: Now that route timing is present on master, VPR runtime has increased as a side affect of the poor lookahead: Screenshot from 2019-06-05 15-19-03

mithro commented 5 years ago

Do we want to disable timing import for now?

elms commented 5 years ago

I saw travis build times have more than doubled and sometimes timeout (50 minute limit). Although maybe we can add -j2, trying in #815 CKANHdf4mKz

litghost commented 5 years ago

With https://github.com/SymbiFlow/vtr-verilog-to-routing/pull/83 and https://github.com/SymbiFlow/symbiflow-arch-defs/pull/802 I've been able to reach a runtime/quality tradeoff that is starting to look reasonable:

A* fac	Fmax (Mhz)	Pack (sec)	Place (sec)	Route (sec)
0	98.1606	135.61	2529.12	342.71
1.2	91.6605	138.38	720.81	350.37
1.6	72.1894	136.93	324	378.17
2	77.6808	131.92	239.23	349.4

I am currently using the same A factor for both placement and routing, next week I will decouple them. If it turns out the placement delay matrix quality is not the sensitive factor, we can run the delay matrix generation with a higher A factor.

Across pack/place/route, VPR is recomputing the router lookahead, which is taking ~100 seconds each time. The place time variation is all in the time to build the placer delay matrix.

litghost commented 5 years ago

For curiosity's sake, I ran both murax and picosoc with A* fac = 0 to see how well VPR could do. Picosoc is running into a router infinite loop, but murax completed. Timing was not being closed, but to do a sanity check, I gave the same synthesis to Vivado and only did place and route, and VPR did pretty good:

Vivado:

Setup :          169  Failing Endpoints,  Worst Slack       -6.754ns,  Total Violation     -241.995ns

Screenshot from 2019-06-11 16-51-51

VPR:

Setup :          438  Failing Endpoints,  Worst Slack       -8.767ns,  Total Violation    -1156.880ns

Screenshot from 2019-06-11 16-53-07

While VPR is returning worse results, the gap is not that big. This instead points to a problem with the synthesis step, as even Vivado isn't closing timing.

litghost commented 5 years ago

For reference if Vivado is used for synthesis and place and route the design closes:

Setup :            0  Failing Endpoints,  Worst Slack        2.409ns,  Total Violation        0.000ns

litghost commented 5 years ago

Comparision between resource utilization:

Vivado synth:

+----------+------+---------------------+
| Ref Name | Used | Functional Category |
+----------+------+---------------------+
| FDRE     |  814 |        Flop & Latch |
| LUT6     |  413 |                 LUT |
| LUT5     |  199 |                 LUT |
| FDCE     |  198 |        Flop & Latch |
| LUT3     |  157 |                 LUT |
| LUT4     |  109 |                 LUT |
| LUT1     |   83 |                 LUT |
| LUT2     |   71 |                 LUT |
| CARRY4   |   50 |          CarryLogic |
| RAMD32   |   24 |  Distributed Memory |
| IBUF     |   18 |                  IO |
| OBUF     |   17 |                  IO |
| FDPE     |   12 |        Flop & Latch |
| RAMS32   |    8 |  Distributed Memory |
| RAMB18E1 |    6 |        Block Memory |
| MUXF7    |    6 |               MuxFx |
| BUFG     |    1 |               Clock |
+----------+------+---------------------+

Yosys synth:

+----------+------+---------------------+
| Ref Name | Used | Functional Category |
+----------+------+---------------------+
| LUT6     | 1912 |                 LUT |
| LUT5     | 1912 |                 LUT |
| FDRE     |  870 |        Flop & Latch |
| FDCE     |  198 |        Flop & Latch |
| RAMD64E  |  160 |  Distributed Memory |
| CARRY4   |   49 |          CarryLogic |
| MUXF7    |   20 |               MuxFx |
| IBUF     |   18 |                  IO |
| OBUF     |   17 |                  IO |
| FDPE     |   13 |        Flop & Latch |
| MUXF8    |    9 |               MuxFx |
| RAMB18E1 |    4 |        Block Memory |
| BUFHCE   |    1 |               Clock |
| BUFGCTRL |    1 |               Clock |
+----------+------+---------------------+

mithro commented 5 years ago

@litghost Can you run with Yosys for synthesis in both cases? The tcl that you need is found here -- also included below;

create_project -force -name top -part xc7a35t-csg324-1
read_xdc top.xdc
read_edif top.edif
link_design -top top -part xc7a35t-csg324-1
report_timing_summary -file top_timing_synth.rpt
report_utilization -hierarchical -file top_utilization_hierarchical_synth.rpt
report_utilization -file top_utilization_synth.rpt
opt_design
place_design
report_utilization -hierarchical -file top_utilization_hierarchical_place.rpt
report_utilization -file top_utilization_place.rpt
report_io -file top_io.rpt
report_control_sets -verbose -file top_control_sets.rpt
report_clock_utilization -file top_clock_utilization.rpt
route_design
phys_opt_design
report_timing_summary -no_header -no_detailed_paths
write_checkpoint -force top_route.dcp
report_route_status -file top_route_status.rpt
report_drc -file top_drc.rpt
report_timing_summary -datasheet -max_paths 10 -file top_timing.rpt
report_power -file top_power.rpt
set_property BITSTREAM.CONFIG.SPI_BUSWIDTH 4 [current_design]
write_bitstream -force top.bit 
write_cfgmem -force -format bin -interface spix4 -size 16 -loadbit "up 0x0 top.bit" -file top.bin
quit

mithro commented 5 years ago

Sorry I jumped the gun - it looks like you have already done that.

mithro commented 5 years ago

@eddiehung - Could you look at what Yosys is doing here?

litghost commented 5 years ago

@eddiehung @mithro With the yosyshq/xiag branch merged Yosys now uses the following primatives:

     $lut                         1361
     CARRY4_VPR                     57
     CE_VCC                        433
     DPRAM64                       160
     DRAM_2_OUTPUT_STUB             80
     FDCE_ZINI                     198
     FDPE_ZINI                      13
     FDRE_ZINI                     870
     MUXF6                          93
     MUXF7                           2
     RAMB18E1_VPR                    4
     SR_GND                        870

which looks much closer to the Vivado synth. I'll run Vivado over this synthesis to check timing closure.

litghost commented 5 years ago

Looks like the new yosys synthesis does a lot better in Vivado place and route:

Vivado reported usage

Report Cell Usage: 
+------+---------+------+
|      |Cell     |Count |
+------+---------+------+
|1     |BUFG     |     1|
|2     |CARRY4   |    57|
|3     |LUT1     |    32|
|4     |LUT2     |   285|
|5     |LUT3     |   259|
|6     |LUT4     |   272|
|7     |LUT5     |   327|
|8     |LUT6     |    93|
|9     |MUXF7    |     2|
|10    |RAM64X1D |    80|
|11    |RAMB18E1 |     4|
|12    |FDCE     |   198|
|13    |FDPE     |    13|
|14    |FDRE     |   870|
|15    |IBUF     |    18|
|16    |OBUF     |    17|
+------+---------+------+

Timing summary:

---------------------------------------------------------------------------------------------------
From Clock:  clk
  To Clock:  clk

Setup :            0  Failing Endpoints,  Worst Slack        1.434ns,  Total Violation        0.000ns
Hold  :            0  Failing Endpoints,  Worst Slack        0.071ns,  Total Violation        0.000ns
PW    :            0  Failing Endpoints,  Worst Slack        3.750ns,  Total Violation        0.000ns
---------------------------------------------------------------------------------------------------

litghost commented 5 years ago

So overall, the Yosys synthesis is still a little worse than the Vivado synthesis, but by a much smaller amount. Some of that different is likely due to the change in RAM behavior. For example, Yosys doesn't infer any RAM32 primatives, and Vivado choose to implement the register file in BRAM instead of LUT-RAM.

litghost commented 5 years ago

With the new synthesis output, VPR is still failing timing, even with A* factor = 0:

---------------------------------------------------------------------------------------------------
From Clock:  io_mainClk
  To Clock:  io_mainClk

Setup :          342  Failing Endpoints,  Worst Slack       -4.497ns,  Total Violation     -540.910ns
Hold  :            0  Failing Endpoints,  Worst Slack        0.095ns,  Total Violation        0.000ns
PW    :            0  Failing Endpoints,  Worst Slack        3.750ns,  Total Violation        0.000ns
---------------------------------------------------------------------------------------------------

Next week I'll explore where gap is coming from. @HackerFoo suggested adding a congestion free placement analysis pass that might shed light into how the placer is doing independent of the router.

mithro commented 5 years ago

@litghost are you able to get the same Vivado table for the Yosys synthesis?

litghost commented 5 years ago

@litghost are you able to get the same Vivado table for the Yosys synthesis?

What do you mean? I believe you are talking about https://github.com/SymbiFlow/symbiflow-arch-defs/issues/716#issuecomment-501400873 ?

litghost commented 5 years ago

I've found that increase bb factor (e.g. expanding the bounding box) was the missing link something.

https://github.com/verilog-to-routing/vtr-verilog-to-routing/issues/658#issue-457755763 shows the current results, but it is possible that using very slow parameters, VPR might close timing on murax!

litghost commented 5 years ago

As of https://github.com/SymbiFlow/symbiflow-arch-defs/pull/986, all xc7 targets are closing at 50 MHz.

f4pga / f4pga-arch-defs

[XC7] Timing quality tracking bug #716