The-OpenROAD-Project / OpenLane

OpenLane is an automated RTL to GDSII flow based on several components including OpenROAD, Yosys, Magic, Netgen and custom methodology scripts for design exploration and optimization.
https://openlane.readthedocs.io/
Apache License 2.0
1.27k stars 370 forks source link

NesterovSolve stuck, eats all available RAM, gets killed #1201

Closed Syndace closed 7 months ago

Syndace commented 2 years ago

Description

Hi. I'm in the awkward position of encountering a bug that I can only reproduce with IP that I'm not allowed to share. I'll try to describe everything as detailed as possible in the hope that someone might have an idea even without being able to reproduce locally.

First for context: the research project I'm working on attempts to build a rather simple RISC-V core with a few extensions and some SRAM memory. However, not using the Skywater PDK, but using a proprietary PDK that we have added to OpenLane locally. The general PDK configuration seems fine, the flow successfully completes on the project without the SRAM.

The part causing trouble is the SRAM: the SRAM cell was generated by the fab providing the PDK and provided to us in the shape of lib/lef/gds files, behavioral models etc. as usual. Possibly important here is that the SRAM cell is not small and takes up almost one square millimeter on the chip.

What happens is that as soon as at least one of those SRAM cells is instantiated by the design, the initial global routing step gets stuck at some iteration of NesterovSolve, slowly eats away all of the available memory, and then either gets killed by the OS or just keeps being stuck like that. When I say "all of the available memory" I really mean it - I added 500GB of swap out of curiosity/to be absolutely sure and let it run for a day; it ate the full 500GB + the physical RAM I have. To track down the problem I have created a minimal project which consists of nothing but one of those memory cells (the following is from the synthesis statistics report):

Number of cells:                  1
     Name_Of_Memory_Cell      1

and even this project gets stuck during the initial global placement (step 5 of the flow):

[NesterovSolve] Iter: 1 overflow: 0 HPWL: 550190766

When compiling the full project with RISC-V core and SRAM, I'm able to get past the initial global placement by setting PL_BASIC_PLACEMENT. When doing that, the flow gets stuck in the same fashion a few steps later, at step 10: "Running Placement Resizer Design Optimizations". The last log line I get in that case is:

[INFO RSZ-0058] Using max wire length 2489um.

I have tried to reproduce the issue with memory generated by OpenRAM and the Skywater PDK, however I wasn't able to. Notably I wasn't able to generate a RAM cell of comparable size to the proprietary one, so there's still the chance that the size plays a role.

The lib/lef/gds of the SRAM cell are included in the build using EXTRA_LIBS, EXTRA_LEFS and EXTRA_GDS_FILES.

For completeness: I've tried to build the project with the proprietary SRAM cell using the Skywater PDK, which doesn't make much sense since this mixes PDKs that have different layers, DRCs etc., but the flow got to the global placement anyway and remained stuck there in the same fashion.

I hope this information gives at least a hint of what the cause might be - it's something common to at least the initial global placement and the placement resizer design optimizations. I'll gladly try anything you can think of/run debug builds or anything like that.

Environment

Kernel: Linux v5.10.0-15-amd64
Distribution: debian 11
Python: v3.9.2 (OK)
Container Engine: docker v20.10.5+dfsg1 (OK)
OpenLane Git Version: 83b61459abcaef3b0ddcf022e1a67e71c88f6007
pip: INSTALLED
pip:venv: INSTALLED
---
PDK Version Verification Status: OK
---
Git Log (Last 3 Commits)

83b6145 2022-06-26T18:19:11+02:00 Fix #1163 (#1164) - Mohamed Gaber -  (HEAD -> master, tag: 2022.06.27_01.36.21, origin/master, origin/HEAD)
a633b1f 2022-06-23T16:53:24+02:00 Rewrite pin spacing algorithm (#1160) - Mohamed Gaber -  (tag: 2022.06.24_01.37.58)
ebad315 2022-06-21T19:56:21+02:00 Fix Antenna Checkers, Magic Script Enhancements (#1154) - Mohamed Gaber -  (tag: 2022.06.22_01.42.47)

Other sections don't apply.

donn commented 2 years ago

@maliberty Thoughts?

maliberty commented 2 years ago

I guess we'll have to play twenty questions.

Syndace commented 2 years ago

Thanks!

I'll answer these based on the minimal design that consists of only a single RAM cell. About the design configuration: the following optional configuration options are set, everything else uses default values:

The flow is not otherwise modified.

What is your utilization?

logs/placement/5-global.log reports:

[INFO GPL-0019] Util(%): 6.40

What fraction of the area is taken by RAMs?

Since the design consists of only a single RAM cell, all of it.

Have you tried manual macro placement?

No, thanks for the hint! Will report back once I've tried it.

What do you mean by "initial global placement"?

Right, sorry, [NesterovSolve] comes after [InitialPlace] in the global placement, not as part of it.

tspyrou can setup support under NDA if that is an option for you.

That's great, I'll find out whether that's possible for us.

maliberty commented 2 years ago

Its more of a developer tool but perhaps you can try adding to your tcl script before global_placement:

gpl::global_placement_debug

Then run that step in the GUI. It should show you how placement is proceeding with periodic pauses. Perhaps you can glean an idea as to what is going wrong.

vijayank88 commented 2 years ago

@Syndace Can you share designs/name/runs/RUNxxx/config.tcl?

donn commented 2 years ago

@Syndace Can you share designs/name/runs/RUNxxx/config.tcl?

I think given the proprietary nature of the IP and PDK, they clearly cannot. We'll have to navigate this problem in an unorthodox manner.

Syndace commented 2 years ago

First of all, thanks everyone for seriously helping with this one, I would've understood if your motivation to help with closed-source stuff was rather low.

manual macro placement doesn't do the trick. The flow gets stuck at "Running Placement Resizer Design Optimizations" with manually placed RAM cells.

runs/RUNxxx/config.tcl contains cell names, layer names, name and dimensions of the core site etc. so I'd rather not share right now.

gpl::global_placement_debug in the GUI I haven't tried yet.

However, I made some progress. Instead of referencing the lib file of the SRAM cell via EXTRA_LIBS, I referenced the behavioral verilog that comes with the SRAM cell using VERILOG_FILES_BLACKBOX. Doing that in addition to manual macro placement let's the flow complete on both the one-cell project and the full project including the RISC-V core and extensions. Now I'm unsure whether that's an actual good solution/workaround, since the lib file contains a good load of timing, delay, capacitance, leakage, ... information that won't be available to the flow like this and I wonder how things like STA work without it.

maliberty commented 2 years ago

It doesn't sound like a correct solution but it hard to say without seeing the results.

Syndace commented 7 months ago

It seems this was solved in the meantime, I can no longer reproduce :)