The-OpenROAD-Project / OpenROAD

OpenROAD's unified application implementing an RTL-to-GDS Flow. Documentation at https://openroad.readthedocs.io/en/latest/
https://theopenroadproject.org/
BSD 3-Clause "New" or "Revised" License
1.6k stars 555 forks source link

mpl2 recurses itself to a seg fault on BoomFrontend #6083

Open jeffng-or opened 1 week ago

jeffng-or commented 1 week ago

Describe the bug

The macro placer runs for 14h before seg faulting on BoomFrontend, which is a sub-module of BoomTile. Note that the segfault isn't seen in the full BoomTile run, which runs for about 1h.

I've re-run the job in GDB and mpl2 is infinitely recursing itself into oblivion. Here's a snippet of the stack trace:


Thread 1 "openroad" received signal SIGSEGV, Segmentation fault.
0x00007ffff3294c5c in __pthread_create_2_1 (newthread=0x556604b97050, attr=0x0, start_routine=0x7ffff36dc240, arg=0x5566029a42b0) at ./nptl/pthread_create.c:621
621 ./nptl/pthread_create.c: No such file or directory.
(gdb) where
#0  0x00007ffff3294c5c in __pthread_create_2_1 (newthread=0x556604b97050, 
    attr=0x0, start_routine=0x7ffff36dc240, arg=0x5566029a42b0)
    at ./nptl/pthread_create.c:621
#1  0x00007ffff36dc329 in std::thread::_M_start_thread(std::unique_ptr<std::thread::_State, std::default_delete<std::thread::_State> >, void (*)()) ()
   from /lib/x86_64-linux-gnu/libstdc++.so.6
#2  0x0000555558597310 in par::KWayFMRefine::InitializeGainBucketsKWay(std::vector<std::shared_ptr<par::PriorityQueue>, std::allocator<std::shared_ptr<par::PriorityQueue> > >&, std::shared_ptr<par::Hypergraph> const&, std::vector<int, std::allocator<int> > const&, std::vector<std::vector<int, std::allocator<int> >, std::allocator<std::vector<int, std::allocator<int> > > > const&, std::vector<float, std::allocator<float> > const&, std::vector<int, std::allocator<int> > const&) const ()
#3  0x00005555585988ff in par::KWayFMRefine::Pass(std::shared_ptr<par::Hypergraph> const&, std::vector<std::vector<float, std::allocator<float> >, std::allocator<std::vector<float, std::allocator<float> > > > const&, std::vector<std::vector<float, std::allocator<float> >, std::allocator<std::vector<float, std::allocator<float> > > > const&, std::vector<std::vector<float, std::allocator<float> >, std::allocator<std::vector<float, std::allocator<float> > > >&, std::vector<std::vector<int, std::allocator<int> >, std::allocator<std::vector<int, std::allocator<int> > > >&, std::vector<float, std::allocator<float> >&, std::vector<int, std::allocator<int> >&, std::vector<bool, std::allocator<bool> >&) ()
#4  0x0000555558583528 in par::Refiner::Refine(std::shared_ptr<par::Hypergraph> const&, std::vector<std::vector<float, std::allocator<float> >, std::allocator<std::vector<float, std::allocator<float> > > > const&, std::vector<std::vector<float, std::allocator<float> >, std::allocator<std::vector<float, std::allocator<float> > > > const&, std::vector<int, std::allocator<int> >&) ()
#5  0x00005555585775aa in par::MultilevelPartitioner::InitialPartition(std::shared_ptr<par::Hypergraph> const&, std::vector<std::vector<float, std::allocator<float> >, std::allocator<std::vector<float, std::allocator<float> > > > const&, std::vector<std::vector<float, std::allocator<float> >, std::allocator<std::vector<float, std::allocator<float> > > > const&, std::vector<std::vector<int, std::allocator<int> >, std::allocator<std::vector<int, std::allocator<int> > > >&, int&) const ()
#6  0x000055555857a144 in par::MultilevelPartitioner::SingleLevelPartition(std::shared_ptr<par::Hypergraph> const&, std::vector<std::vector<float, std::allocator<float> >, std::allocator<std::vector<float, std::allocator<float> > > > const&, std::vector<std::vector<float, std::allocator<float> >, std::allocator<std::vector<float, std::allocator<float> > > > const&) const ()
#7  0x000055555857a7d4 in par::MultilevelPartitioner::Partition(std::shared_ptr<par::Hypergraph> const&, std::vector<std::vector<float, std::allocator<float> >,--Type <RET> for more, q to quit, c to continue without paging--
#8  0x000055555854860f in par::TritonPart::MultiLevelPartition() ()
#9  0x000055555854a339 in par::TritonPart::PartitionKWaySimpleMode(unsigned int, float, unsigned int, std::vector<std::vector<int, std::allocator<int> >, std::allocator<std::vector<int, std::allocator<int> > > > const&, std::vector<float, std::allocator<float> > const&, std::vector<float, std::allocator<float> > const&) ()
#10 0x000055555853f6e6 in par::PartitionMgr::PartitionKWaySimpleMode(unsigned int, float, unsigned int, std::vector<std::vector<int, std::allocator<int> >, std::allocator<std::vector<int, std::allocator<int> > > > const&, std::vector<float, std::allocator<float> > const&, std::vector<float, std::allocator<float> > const&) ()
#11 0x00005555581ba37c in mpl2::ClusteringEngine::breakLargeFlatCluster(mpl2::Cluster*) ()
#12 0x00005555581ba62b in mpl2::ClusteringEngine::breakLargeFlatCluster(mpl2::Cluster*) ()
#13 0x00005555581ba63a in mpl2::ClusteringEngine::breakLargeFlatCluster(mpl2::Cluster*) ()
...
#11113 0x00005555581bba5b in mpl2::ClusteringEngine::updateSubTree(mpl2::Cluster*) ()
#11114 0x00005555581c45b5 in mpl2::ClusteringEngine::multilevelAutocluster(mpl2::Cluster*) ()
#11115 0x00005555581c4ee7 in mpl2::ClusteringEngine::run() ()
#11116 0x0000555558145a48 in mpl2::HierRTLMP::runMultilevelAutoclustering() ()

The full-ish stack trace can be found at: https://drive.google.com/file/d/10MMydy8f761RPeXXE5FKgIVFDAtlWwCn/view?usp=sharing

The tarball can be found at: https://drive.google.com/file/d/1PH8jZAREhRn4NIVryR7pes3sKNGSIBqs/view?usp=sharing

Expected Behavior

Successful mpl2 run without a seg fault and running less than 1h

Environment

commit defc349ec719f45e115b85e317e95db86769e439 (HEAD -> master, origin/master, origin/HEAD)
Merge: e30f8fc8 8c3afb17
Author: Matt Liberty <mliberty@precisioninno.com>
Date:   Mon Oct 28 18:28:51 2024 -0700

    Merge pull request #2520 from Pinata-Consulting/makefile-do-floorplan-fix-2

    makefile: fix one more do-floorplan gaffe

To Reproduce

  1. unpack the tarball (link in the description)
  2. source your ORFS env.sh
  3. execute run-me-BoomFrontend-asap7-base.sh

Relevant log output

No response

Screenshots

No response

Additional Context

No response

AcKoucher commented 1 week ago

@jeffng-or Apparently it's not mpl2 itself that is blowing up. During clustering, we call par (TritonPart) to partition big flat clusters i.e., big clusters made of only leaf macros/std cells. Based on your log, the segfault is happening inside par.

jeffng-or commented 1 week ago

@jeffng-or Apparently it's not mpl2 itself that is blowing up. During clustering, we call par (TritonPart) to partition big flat clusters i.e., big clusters made of only leaf macros/std cells. Based on your log, the segfault is happening inside par.

Sure, makes sense. The key point is that breakLargeFlatCluster recurses down 11100 frames (I think I cut the stack trace file off one level too soon, so my bad on that). The fact that we down effectively infinitely will eventually cause a failure somewhere and it happens to be in par.

maliberty commented 1 week ago

@AcKoucher the end of the stack is in par but most of the stack is in mpl2. I think the problem is the recursion in breakLargeFlatCluster. How many parts are we trying to break this cluster down into? I suspect something is off in the cluster size.

AcKoucher commented 1 week ago

@maliberty I see. I'll investigate.

maliberty commented 1 week ago

If that much splitting is necessary then you can write it non-recursively.

AcKoucher commented 6 days ago

Apparently TritonPart is doing a terrible job when trying partitioning (ftq)_glue_logic

[DEBUG MPL-multilevel_autoclustering] Breaking flat cluster (ftq)_glue_logic with TritonPart
[DEBUG MPL-multilevel_autoclustering] Setting Cluster Metrics for (ftq)_glue_logic_0: Num Macros: 1 Num Std Cells: 31578
[DEBUG MPL-multilevel_autoclustering] Setting Cluster Metrics for (ftq)_glue_logic_1: Num Macros: 0 Num Std Cells: 0
[DEBUG MPL-multilevel_autoclustering] Breaking flat cluster (ftq)_glue_logic_0 with TritonPart
[DEBUG MPL-multilevel_autoclustering] Setting Cluster Metrics for (ftq)_glue_logic_0_0: Num Macros: 1 Num Std Cells: 31578
[DEBUG MPL-multilevel_autoclustering] Setting Cluster Metrics for (ftq)_glue_logic_0_1: Num Macros: 0 Num Std Cells: 0
[DEBUG MPL-multilevel_autoclustering] Breaking flat cluster (ftq)_glue_logic_0_0 with TritonPart
[DEBUG MPL-multilevel_autoclustering] Setting Cluster Metrics for (ftq)_glue_logic_0_0_0: Num Macros: 1 Num Std Cells: 31578
[DEBUG MPL-multilevel_autoclustering] Setting Cluster Metrics for (ftq)_glue_logic_0_0_1: Num Macros: 0 Num Std Cells: 0
[DEBUG MPL-multilevel_autoclustering] Breaking flat cluster (ftq)_glue_logic_0_0_0 with TritonPart
[DEBUG MPL-multilevel_autoclustering] Setting Cluster Metrics for (ftq)_glue_logic_0_0_0_0: Num Macros: 1 Num Std Cells: 31578
[DEBUG MPL-multilevel_autoclustering] Setting Cluster Metrics for (ftq)_glue_logic_0_0_0_1: Num Macros: 0 Num Std Cells: 0
[DEBUG MPL-multilevel_autoclustering] Breaking flat cluster (ftq)_glue_logic_0_0_0_0 with TritonPart
[DEBUG MPL-multilevel_autoclustering] Setting Cluster Metrics for (ftq)_glue_logic_0_0_0_0_0: Num Macros: 1 Num Std Cells: 31198
[DEBUG MPL-multilevel_autoclustering] Setting Cluster Metrics for (ftq)_glue_logic_0_0_0_0_1: Num Macros: 0 Num Std Cells: 380
[DEBUG MPL-multilevel_autoclustering] Breaking flat cluster (ftq)_glue_logic_0_0_0_0_0 with TritonPart
[DEBUG MPL-multilevel_autoclustering] Setting Cluster Metrics for (ftq)_glue_logic_0_0_0_0_0_0: Num Macros: 1 Num Std Cells: 31198
[DEBUG MPL-multilevel_autoclustering] Setting Cluster Metrics for (ftq)_glue_logic_0_0_0_0_0_1: Num Macros: 0 Num Std Cells: 0
[DEBUG MPL-multilevel_autoclustering] Breaking flat cluster (ftq)_glue_logic_0_0_0_0_0_0 with TritonPart
[DEBUG MPL-multilevel_autoclustering] Setting Cluster Metrics for (ftq)_glue_logic_0_0_0_0_0_0_0: Num Macros: 1 Num Std Cells: 31198
[DEBUG MPL-multilevel_autoclustering] Setting Cluster Metrics for (ftq)_glue_logic_0_0_0_0_0_0_1: Num Macros: 0 Num Std Cells: 0
[DEBUG MPL-multilevel_autoclustering] Breaking flat cluster (ftq)_glue_logic_0_0_0_0_0_0_0 with TritonPart
[DEBUG MPL-multilevel_autoclustering] Setting Cluster Metrics for (ftq)_glue_logic_0_0_0_0_0_0_0_0: Num Macros: 0 Num Std Cells: 0
[DEBUG MPL-multilevel_autoclustering] Setting Cluster Metrics for (ftq)_glue_logic_0_0_0_0_0_0_0_1: Num Macros: 1 Num Std Cells: 31198
[DEBUG MPL-multilevel_autoclustering] Breaking flat cluster (ftq)_glue_logic_0_0_0_0_0_0_0_1 with TritonPart
[DEBUG MPL-multilevel_autoclustering] Setting Cluster Metrics for (ftq)_glue_logic_0_0_0_0_0_0_0_1_0: Num Macros: 1 Num Std Cells: 31198
[DEBUG MPL-multilevel_autoclustering] Setting Cluster Metrics for (ftq)_glue_logic_0_0_0_0_0_0_0_1_1: Num Macros: 0 Num Std Cells: 0
[sinks into oblivion ...]
AcKoucher commented 6 days ago

@maliberty I'm not sure how to proceed here. Should mpl2 reject the result and take care of splitting the cluster if the partitions generated by TritonPart are not good?

maliberty commented 6 days ago

I think TP should be fixed.