Open acomodi opened 4 years ago
@litghost @mithro @HackerFoo FYI
I have resumed the work on getting the lookahead in a mergeable state with upstream. First I successfully reproduced the results I have obtained a couple of weeks ago, than I have moved to use also the latest yosys master+wip and got improvements on placer run-time and quality of results as well.
By further analyzing the router behaviour with debugging and profiler enabled I still see a non-trivial amount of connections that take way too long compared to others as well as several backtracking step due to incorrect lookahead costs retrieval.
I need to collect more data to have a clearer view of what is happening.
One behaviour that I have seen, which should not happen IMO, is that the placer produces a solution with a ~1.5 ns lower than the final CPD obtained after the routing step.
By analyzing the routing debug information when running some of the symbiflow-arch-defs tests, I could see that the Worst connections
are relative to four kinds of categories:
Lookahead map:
Worst conn was RR node: 0 type: SOURCE location: (1,1) class: 0 capacity: 1 to RR node: 1101656 type: SINK location: (76,59) class: 235 capacity: 1 (crit: 0.980000) took 0.059810
Connection Box Lookahead map:
Worst conn was RR node: 0 type: SOURCE location: (1,1) class: 0 capacity: 1 to RR node: 1855792 type: SINK location: (61,106) class: 8 capacity: 1 (crit: 0.980000) took 0.028980
Connections that have a sink pin set to DMUX
Connections which involve a BRAM pin
Clock input connections:
Worst conn was RR node: 1396500 type: SOURCE location: (114,78) class: 53 capacity: 1 to RR node: 1855791 type: SINK location: (61,106) class: 7 capacity: 1 (crit: 0.980000) took 0.121228
One thing that I need to do to have a clearer view of what is happening is to have a version of VPR that implements both the connection box lookahead and the new lookahead map. This because I have tested the whole flow (packing, placement and routing) with two different VtR versions. In order to better compare the results, it is needed to have same .place and .net files in input to the routing step.
After some debugging of the router behaviour, I have applied some fixes to the new lookahead map code. The fixes I have added are:
add_paths
remove the site_pin_delay
from the pathsite_pin_delay
value that, prior to the fixes, could return an infinite valueI have run the baselitex test, once with the current master+wip and the arch-defs-master (baseline) and the second time with the new lookahead VPR code (https://github.com/acomodi/vtr-verilog-to-routing/tree/new-lookahead-map-fixes) and the archdefs master with some fixes to have it running with the new lookahead map (https://github.com/antmicro/symbiflow-arch-defs/tree/new-lookahead-map-integration) (changes).
Both the tests were run with the latest yosys master+wip which introduced a worsening in CPD and runtime as well.
Baseline:
---- ------ ------- ---- ------- ------- ------- ----------------- --------------- -------- ---------- ---------- ---------- ---------- --------
Iter Time pres BBs Heap Re-Rtd Re-Rtd Overused RR Nodes Wirelength CPD sTNS sWNS hTNS hWNS Est Succ
(sec) fac Updt push Nets Conns (ns) (ns) (ns) (ns) (ns) Iter
---- ------ ------- ---- ------- ------- ------- ----------------- --------------- -------- ---------- ---------- ---------- ---------- --------
Warning 75: 29 timing startpoints were not constrained during timing analysis
Warning 76: 3105 timing endpoints were not constrained during timing analysis
1 29.1 0.0 0 4.0e+08 15283 56484 26759 ( 0.914%) 688970 (10.7%) 19.300 -378.1 -2.634 0.000 0.000 N/A
2 8.9 4.0 0 9.3e+07 10892 41502 8515 ( 0.291%) 832288 (13.0%) 18.360 -127.1 -1.694 0.000 0.000 N/A
3 8.3 5.2 13 7.2e+07 6175 22296 4796 ( 0.164%) 879231 (13.7%) 18.293 -138.5 -1.627 0.000 0.000 N/A
4 10.0 6.8 50 7.8e+07 3674 15065 2616 ( 0.089%) 912145 (14.2%) 18.327 -136.1 -1.661 0.000 0.000 N/A
5 9.6 8.8 53 7.2e+07 2086 9442 1264 ( 0.043%) 941215 (14.7%) 18.480 -165.4 -1.814 0.000 0.000 N/A
6 8.4 11.4 80 6.1e+07 1102 5125 651 ( 0.022%) 956935 (14.9%) 18.496 -167.2 -1.830 0.000 0.000 N/A
7 10.8 14.9 75 7.4e+07 657 2749 368 ( 0.013%) 968462 (15.1%) 18.454 -163.2 -1.788 0.000 0.000 N/A
8 10.9 19.3 48 7.8e+07 391 1723 229 ( 0.008%) 975072 (15.2%) 18.581 -194.8 -1.915 0.000 0.000 N/A
9 13.1 25.1 45 8.7e+07 243 983 145 ( 0.005%) 979506 (15.2%) 18.676 -222.5 -2.010 0.000 0.000 N/A
10 15.8 32.6 38 9.6e+07 179 587 100 ( 0.003%) 983145 (15.3%) 18.676 -238.8 -2.010 0.000 0.000 18
11 21.8 42.4 28 1.3e+08 131 539 65 ( 0.002%) 986148 (15.4%) 18.676 -238.8 -2.010 0.000 0.000 20
12 19.0 55.1 23 1.1e+08 90 469 54 ( 0.002%) 988512 (15.4%) 18.676 -239.7 -2.010 0.000 0.000 20
13 21.6 71.7 16 1.3e+08 76 235 43 ( 0.001%) 990044 (15.4%) 18.676 -239.9 -2.010 0.000 0.000 22
14 15.0 93.2 19 8.7e+07 58 179 31 ( 0.001%) 990801 (15.4%) 18.676 -239.9 -2.010 0.000 0.000 23
15 9.9 121.1 9 5.8e+07 43 126 21 ( 0.001%) 991256 (15.4%) 18.676 -239.9 -2.010 0.000 0.000 24
16 4.6 157.5 10 2.9e+07 37 70 15 ( 0.001%) 991920 (15.4%) 18.676 -239.9 -2.010 0.000 0.000 24
17 5.0 204.7 11 3.0e+07 24 53 16 ( 0.001%) 992253 (15.4%) 18.676 -239.9 -2.010 0.000 0.000 25
18 5.3 266.2 5 3.2e+07 26 65 15 ( 0.001%) 992536 (15.4%) 18.676 -239.9 -2.010 0.000 0.000 26
19 5.5 346.0 4 3.2e+07 21 55 12 ( 0.000%) 992766 (15.5%) 18.676 -239.9 -2.010 0.000 0.000 28
20 7.2 449.8 4 4.0e+07 20 47 13 ( 0.000%) 993150 (15.5%) 18.676 -239.9 -2.010 0.000 0.000 29
21 3.1 584.8 4 1.8e+07 15 86 6 ( 0.000%) 993749 (15.5%) 18.676 -239.9 -2.010 0.000 0.000 32
22 1.7 760.2 5 9954971 10 13 4 ( 0.000%) 994033 (15.5%) 18.676 -239.9 -2.010 0.000 0.000 30
23 1.7 988.3 1 1.0e+07 8 15 2 ( 0.000%) 994270 (15.5%) 18.676 -240.5 -2.010 0.000 0.000 29
24 0.8 1284.7 2 4480699 3 5 2 ( 0.000%) 994370 (15.5%) 18.676 -240.5 -2.010 0.000 0.000 28
25 2.8 1670.2 0 1.5e+07 4 14 1 ( 0.000%) 994397 (15.5%) 18.676 -240.5 -2.010 0.000 0.000 27
26 0.1 2171.2 1 2193 1 1 0 ( 0.000%) 994363 (15.5%) 18.676 -240.5 -2.010 0.000 0.000 27
Restoring best routing
Critical path: 18.6762 ns
...
# Routing took 252.15 seconds (max_rss 3466.0 MiB, delta_rss +0.0 MiB)
Router run-time: 252.15 seconds CPD: 18.6762 ns
Changes:
---- ------ ------- ---- ------- ------- ------- ----------------- --------------- -------- ---------- ---------- ---------- ---------- --------
Iter Time pres BBs Heap Re-Rtd Re-Rtd Overused RR Nodes Wirelength CPD sTNS sWNS hTNS hWNS Est Succ
(sec) fac Updt push Nets Conns (ns) (ns) (ns) (ns) (ns) Iter
---- ------ ------- ---- ------- ------- ------- ----------------- --------------- -------- ---------- ---------- ---------- ---------- --------
Warning 305: 29 timing startpoints were not constrained during timing analysis
Warning 306: 3105 timing endpoints were not constrained during timing analysis
1 47.8 0.0 0 6.5e+08 15283 56484 27050 ( 0.924%) 682596 (10.6%) 20.496 -1235. -3.830 0.000 0.000 N/A
2 14.7 4.0 2 1.4e+08 10819 40675 7724 ( 0.264%) 819197 (12.8%) 19.085 -197.3 -2.419 0.000 0.000 N/A
3 13.5 5.2 14 1.1e+08 5926 21728 4322 ( 0.148%) 859622 (13.4%) 19.064 -209.6 -2.398 0.000 0.000 N/A
4 13.4 6.8 32 1.0e+08 3416 15002 2288 ( 0.078%) 892266 (13.9%) 19.095 -208.7 -2.429 0.000 0.000 N/A
5 11.2 8.8 52 8.1e+07 1848 8928 1023 ( 0.035%) 915860 (14.3%) 19.124 -216.9 -2.458 0.000 0.000 N/A
6 6.5 11.4 46 4.7e+07 941 4213 445 ( 0.015%) 930869 (14.5%) 19.275 -264.7 -2.609 0.000 0.000 N/A
7 4.2 14.9 58 3.3e+07 419 2688 176 ( 0.006%) 938294 (14.6%) 19.275 -271.1 -2.609 0.000 0.000 N/A
8 5.2 19.3 26 3.4e+07 199 825 94 ( 0.003%) 942985 (14.7%) 19.275 -270.8 -2.609 0.000 0.000 N/A
9 5.1 25.1 25 3.3e+07 108 384 59 ( 0.002%) 944694 (14.7%) 19.275 -270.7 -2.609 0.000 0.000 N/A
10 6.4 32.6 10 4.0e+07 80 251 35 ( 0.001%) 946547 (14.7%) 19.275 -270.9 -2.609 0.000 0.000 14
11 6.0 42.4 10 3.4e+07 46 105 24 ( 0.001%) 947326 (14.7%) 19.275 -270.9 -2.609 0.000 0.000 16
12 7.4 55.1 5 3.9e+07 36 92 16 ( 0.001%) 948416 (14.8%) 19.275 -270.8 -2.609 0.000 0.000 16
13 4.1 71.7 8 2.2e+07 25 67 14 ( 0.000%) 949500 (14.8%) 19.275 -271.2 -2.609 0.000 0.000 18
14 7.1 93.2 7 3.6e+07 20 48 10 ( 0.000%) 949967 (14.8%) 19.275 -271.2 -2.609 0.000 0.000 19
15 5.7 121.1 4 2.8e+07 18 43 7 ( 0.000%) 950189 (14.8%) 19.275 -271.2 -2.609 0.000 0.000 20
16 2.2 157.5 2 1.1e+07 10 29 6 ( 0.000%) 950498 (14.8%) 19.275 -271.2 -2.609 0.000 0.000 20
17 6.6 204.7 1 2.7e+07 13 38 5 ( 0.000%) 950935 (14.8%) 19.275 -271.3 -2.609 0.000 0.000 21
18 2.5 266.2 3 9575677 8 66 3 ( 0.000%) 951352 (14.8%) 19.275 -271.3 -2.609 0.000 0.000 22
19 0.3 346.0 3 1583485 4 11 4 ( 0.000%) 951448 (14.8%) 19.275 -271.2 -2.609 0.000 0.000 22
20 5.0 449.8 0 1.7e+07 6 27 3 ( 0.000%) 951613 (14.8%) 19.275 -271.2 -2.609 0.000 0.000 23
21 2.3 584.8 2 8493268 5 9 2 ( 0.000%) 951715 (14.8%) 19.275 -271.2 -2.609 0.000 0.000 24
22 2.1 760.2 3 7624080 4 11 2 ( 0.000%) 951657 (14.8%) 19.275 -271.2 -2.609 0.000 0.000 24
23 1.7 988.3 0 7082008 4 5 1 ( 0.000%) 951808 (14.8%) 19.275 -271.4 -2.609 0.000 0.000 25
24 0.0 1284.7 0 1612 1 1 0 ( 0.000%) 951806 (14.8%) 19.275 -271.4 -2.609 0.000 0.000 24
Restoring best routing
Critical path: 19.2752 ns
...
# Routing took 183.15 seconds (max_rss 3602.4 MiB, delta_rss +0.0 MiB)
Router run-time: 183.15 seconds CPD: 19.2752 ns
To sum up, the tests performed show a reduce in run-time equal to 27.36% and an increase in CPD equal to 3.10% w.r.t. the baseline.
GitHubVerilog to Routing -- Open Source CAD Flow for FPGA Research - acomodi/vtr-verilog-to-routing
GitHubFOSS architecture definitions of FPGA hardware useful for doing PnR device generation. - antmicro/symbiflow-arch-defs
To sum up, the tests performed show a reduce in run-time equal to 27.36% and an increase in CPD equal to 3.10% w.r.t. the baseline.
I think this is close enough that we really should work on getting the lookahead changes on upstream. I expect that the reduction is runtime could be converted into the same runtime with same or better CPD by lowing the astar factor.
Overall, great work. Let's start to get your changes in upstream VPR.
To sum up, the tests performed show a reduce in run-time equal to 27.36% and an increase in CPD equal to 3.10% w.r.t. the baseline.
To be clear, the new lookahead is based on an extension of the upstream map lookahead, comparing against the downstream connection box lookahead?
To be clear, the new lookahead is based on an extension of the upstream map lookahead, comparing against the downstream connection box lookahead?
Indeed. Actually the new lookahead uses code from both upstream map and connection box, where the connection boxes are not used anymore. I'm general:
From upstream
From downstream connection boxes:
Okay, so the sampling method and Dijkstra expansion (for place delay) should be both go upstream, yes? The penalty cost stuff is something we need to handle separately.
For my own clarity, all the changes you have that are not upstream only affect the generation of the lookahead files, not in their use?
Okay, so the sampling method and Dijkstra expansion (for place delay) should be both go upstream, yes?
Djikstra expansion is actually related to the lookahead generation, when expanding the sample nodes to find the costs. The upstream lookahead uses a different way of expanding nodes which is less performant than the connection box expansion.
The problem I am facing is that, running the regression tests with the current new lookahead generation (sampling + expanding nodes) is that they take way too long to compute the lookahead itself.
I am thinking that it would be best, in order to not interfere with the upstream lookahead map, to set another lookahead option which may be called extended_map
which makes use of Dusty's sampling method (which produces more accurate samples, at least for 7-Series).
The penalty cost stuff is something we need to handle separately
To be clear, the penalty cost stuff is related to assigning a cost to the lookahead map entry based on their distance from the origin node, here
For my own clarity, all the changes you have that are not upstream only affect the generation of the lookahead files, not in their use?
The only change in the use of the lookahead file is an addition to get the CHAN -> IPIN delay cost as well as the OPIN -> CHAN. Apart from that, the ways the lookahead is used do not change.
PR upstream open here: https://github.com/verilog-to-routing/vtr-verilog-to-routing/pull/1449
Additional note: the fixes resulted in an increase of the lookahead.bin
size from 15M to 55M for the 50T device.
Update: I am trying to get the upstream lookahead and the downstream new lookahead as similar as possible. I have updated https://github.com/verilog-to-routing/vtr-verilog-to-routing/pull/1449 to get towards the goal of having less differences between the two approaches. I am also testing the PR with the quick titan qor to get an understanding on how the changes to the upstream lookahead affect QoR. For what I am seeing by preliminary results, CPD is lower, but run-time to compute the lookahead and routing runtime has increased.
I have encountered one issue though that regards QoR and runtime with the extended lookahead and symbiflow targets. The problem is that I am not able to reproduce the results reported in the above comment. By building everything using those commits I do get worse results, and I am trying to figure out what is causing this behaviour. I think that I still need to get in the debugging phase of the new lookahead, as it is clearly still unstable.
Results differences
I was able to identify the issue related to the differences in the results obtained recently with the one reported in the above comment
The difference is due to the place delay matrix calculation. For some reason, while running the test that got me the results of the new lookahead, the delay matrix that was used was the one generated with the astar
method instead of dijkstra
. It results, after several experiments, that, with the "new" lookahead, the delay matrix generated with dijkstra performs worse.
The same applies for the connection box lookahead. I have noticed that, even though the dijkstra expansion for the delay matrix is already in the master+wip, it is not currently enabled in archdefs, therefore, the delay matrix uses the original astar method (which makes use of the lookahead to get the delay results).
For what I have seen. the dijkstra delay matrix generation, instead, uses no lookahead to get the delays, and I think it makes sense as otherwise we would fall back to use the astar method to calculate the delays.
Other issues:
During debugging, I also found a bug with the new lookahead code. Instead of the connection box ids, the new lookahead distinguishes between CHANX and CHANY nodes to get the cost entry for a specific node -> SINK connection.
The lookahead matrix is indexed as follows:
cost_map[seg_index][chan_type][deltax][deltay]
The problem was that CHAN type was always the same. This was resulting by the fact that I have erroneously got the rr_type from the target node instead of the source node. This bug got undetected as the node type was wrong both when generating the lookahead and when querying it to get the cost entry.
By fixing this issue and having a disctinction between CHANX and CHANY of the source node, the router performs worse than the case in which CHANX and CHANY were merged in a unique entry.
The other thing I debugged was the aster method to get the delay matrix huge runtime. When calculating the delay matrix for the 50T, there are some unroutable connections that appear when trying to connect sources and sinks of the CLBs to all the other nodes.
For instance, I get the following unroutable connection:
Cannot route from BLK-TL-CLBLL_L[0].CLBLL_LL_A[0] (RR node: 31398 type: SOURCE location: (11,2) class: 31 capacity: 1) to BLK-TL-HCLK_IOI3[0].HCLK_IOI_IDELAYCTRL_REFCLK[0] (RR node: 2927656 type: SINK location: (114,133) class: 5 capacity: 1) -- no possible path
Clearly, there CLBs cannot be connected to the REFCLK of the IDELAYCTRL as that pin is within the clocking network and can be connected to only a clock tile.
In the case of using the connection box lookahead, the excpected cost returned is equal to infinite. This happens here https://github.com/SymbiFlow/vtr-verilog-to-routing/blob/8980e46218542888fac879961b13aa7b0fba8432/vpr/src/route/connection_box_lookahead_map.cpp#L417-L431
I suppose that the for loop never gets executed and the lookahead results to be infinite, with the router exiting and returning an unroutable connection
and getting to the next delay.
In the case of the new lookahead, instead, the expected cost does not result to be infinite and the cost entry appears to be valid. Therefore the router tries to find a path, taking a huge amount of time before understanding that the connection is unroutable.
Under these newly discovered data, I am unsure how to proceed. Regarding the CHANX/CHANY distinction, I think that collapsing the cost entries in one entry only is beneficial, but the new lookahead greatly depends on the placement delay matrix.
@litghost I think that, with the new information I got and summarized above, there are still issues to be solved with the lookahead if we want to stop using the connection boxes.
Currently, I am unsure of how to deal with the place_delay_matrix issue.
@litghost I think that, with the new information I got and summarized above, there are still issues to be solved with the lookahead if we want to stop using the connection boxes.
Currently, I am unsure of how to deal with the place_delay_matrix issue.
The existing baseline usage of the connection box lookahead should be using Dijkstra for computing the place_delay_matrix, so we should leave that variable fixed.
Ok, we should then enable that in archdefs, as it is not currently set. From what I have seen, this will likely result in a worsening of QoR and runtime
Ok, we should then enable that in archdefs, as it is not currently set. From what I have seen, this will likely result in a worsening of QoR and runtime
It already is enabled, see https://github.com/SymbiFlow/symbiflow-arch-defs/blob/f0e7b4212544e1d55da776fb7a2ff79117e01454/xc/common/cmake/arch_define.cmake#L64
GitHubFOSS architecture definitions of FPGA hardware useful for doing PnR device generation. - SymbiFlow/symbiflow-arch-defs
Actually I think we might have missed to upgrade also the device_define parameters, hence the behaviour I have seen: https://github.com/SymbiFlow/symbiflow-arch-defs/blob/f0e7b4212544e1d55da776fb7a2ff79117e01454/xc/common/cmake/device_define.cmake#L357
GitHubFOSS architecture definitions of FPGA hardware useful for doing PnR device generation. - SymbiFlow/symbiflow-arch-defs
Good catch, let me review the CI logs.
Actually I think we might have missed to upgrade also the device_define parameters, hence the behaviour I have seen: https://github.com/SymbiFlow/symbiflow-arch-defs/blob/f0e7b4212544e1d55da776fb7a2ff79117e01454/xc/common/cmake/device_define.cmake#L357
GitHubSymbiFlow/symbiflow-arch-defsFOSS architecture definitions of FPGA hardware useful for doing PnR device generation. - SymbiFlow/symbiflow-arch-defs
You are right, but this was an oversight. I believe the cache args were overlooked when we incorporated the Dijkstra expansion upstream :(
GitHubFOSS architecture definitions of FPGA hardware useful for doing PnR device generation. - SymbiFlow/symbiflow-arch-defs
I am currently running tests and trying to get better results out of the extended lookahead map. So far, the tests I have performed are based on a master+wip VTR version which has been rebased on top of the current VTR upstream master and, on which, I have applied the extended lookahead as well.
Even though the VTR version is the same for both changes
and baseline
tests (which generates same place delay matrix and same packing output), the placer seem to produce different results when building the two set of tests.
To overcome this issue, and given that the packed netlist is the same in both the test sets, I could use the placer output of the changes
test set for the baseline
test set.
Below there is a summary of these tests:
Run Time (s) | Baseline | Changes | Ratio |
---|---|---|---|
murax | 7.55 | 13.75 | 1.68x |
picosoc | 18.18 | 30.68 | 1.68x |
mini litex | 22.67 | 35.37 | 1.56x |
mini litex ddr | 71.86 | 117.52 | 1.63x |
litex linux | 333.26 | 293.20 | 0.87x |
Critical Path (ns) | Baseline | Changes | Ratio |
---|---|---|---|
murax | 13.3948 | 13.3404 | 0.995x |
picosoc | 15.8749 | 15.8356 | 0.997x |
mini litex | 13.3472 | 13.421 | 1.005x |
mini litex ddr | 18.3625 | 18.2212 | 0.992 |
litex linux | 19.1891 | 18.4803 | 0.963x |
The run-time hit is still present, with the exception of the baselitex design, while the CPD is mostly similar in both the tests set.
Regarding the VTR regression tests, even by fixing the CHANX/CHANY collapse issue, the some of the tests end up failing the QoR checks. I am currently trying to get an implementation that keeps the old upstream lookahead generation and usage, while having the two solutions (lookahead map and extended lookahead map) as similar as possible.
Related issues/PRs
There have been efforts to push the lookahead used in this fork upstream. All the related work/information can be found in the following links: https://github.com/verilog-to-routing/vtr-verilog-to-routing/issues/1325 https://github.com/verilog-to-routing/vtr-verilog-to-routing/pull/1351 https://github.com/verilog-to-routing/vtr-verilog-to-routing/pull/1367
Current status
The integration of the connection box based lookahead and the upstream lookahead has been partially done. To reproduce the results obtained so far, the following versions of the tools are needed:
The VTR branch is still under development as it is not yet 100% functional. In fact, there still are two assertions that do fail and have been temporarily disabled to continue in the P&R flow. They are probably due to some incorrect lookahead values for some delta_x/y and some segment types and/or channels.
The results obtained by using the above commits are the following for the base litex design: