LLNL / SAMRAI

Structured Adaptive Mesh Refinement Application Infrastructure - a scalable C++ framework for block-structured AMR application development
https://computing.llnl.gov/projects/samrai
Other
224 stars 80 forks source link

Bug using non-uniform workload to rebalance coarsest level with CascadePartitioner #250

Closed nselliott closed 11 months ago

nselliott commented 11 months ago

Reported by a user via email:

I am trying to use CascadePartitioner load balancer with a non-uniform work load to rebalance the coarsest level. However, when I do so I get a segmentation fault when the CascadePartitioner tries to create a connector from the workload_to_reference (see CascadePartitioner.cpp line 339 ).

This can be reproduced modifying line 499 of the test/applications/LinAdv/main.cpp from: double dt_new = time_integrator->advanceHierarchy(dt_now); to double dt_new = time_integrator->advanceHierarchy(dt_now, true); and using the input source/test/applications/LinAdv/test_inputs/test_nonuniform.2d.input (or the 3d too).

nselliott commented 11 months ago

I have reproduced this and I think there is a straightforward fix. We have a gap in testing these two features together: CascadePartitioner's non-uniform load balance and rebalancing Level 0 after its initialization.

We cannot support the non-uniform load balance for the initial decomposition of Level 0, since the feature requires workload data that cannot exist before Level 0 is fully initialized, but we should support it at any time after that, including before an application begins to advance.

nicolasaunai commented 11 months ago

we're seing that too in our project, trying to balance L0 given the number of particles per mesh cells after init as a work load variable.

jesusbonilla commented 11 months ago

I have tried to patch the Cascade Partitioner, below there is a diff of the changes introduced. It seems to do the right thing, but I am not sure if all changes make sense. Thanks again!

diff --git a/source/SAMRAI/mesh/CascadePartitioner.cpp b/source/SAMRAI/mesh/CascadePartitioner.cpp
index 338f555e2..095b570c7 100644
--- a/source/SAMRAI/mesh/CascadePartitioner.cpp
+++ b/source/SAMRAI/mesh/CascadePartitioner.cpp
@@ -384,20 +384,26 @@ CascadePartitioner::loadBalanceBoxLevel(
       std::shared_ptr<hier::PatchLevel> current_level(
          hierarchy->getPatchLevel(level_number));

+      const int ref_level_number = level_number > 0 ? level_number - 1 : level_number;
+
       const hier::Connector& current_to_reference = 
-         current_level->getBoxLevel()->findConnector(
+         current_level->getBoxLevel()->findConnectorWithTranspose(
             workload_to_reference->getHead(),
-            hierarchy->getRequiredConnectorWidth(level_number, level_number-1),
+            hierarchy->getRequiredConnectorWidth(level_number, ref_level_number),
+            hierarchy->getRequiredConnectorWidth(ref_level_number, level_number),
             hier::CONNECTOR_CREATE,
             true);

       const hier::Connector& reference_to_current =
-         workload_to_reference->getHead().findConnector(
+         workload_to_reference->getHead().findConnectorWithTranspose(
             *current_level->getBoxLevel(),
-            hierarchy->getRequiredConnectorWidth(level_number-1, level_number),
+            hierarchy->getRequiredConnectorWidth(ref_level_number, level_number),
+            hierarchy->getRequiredConnectorWidth(level_number, ref_level_number),
             hier::CONNECTOR_CREATE,
             true);

+      // current_to_reference.setTranspose(&reference_to_current, false);
+      // reference_to_current.setTranspose(&current_to_reference, false);
       /*
        * All of the above Connector work was so that we can call these
        * bridge operations to connect the current and workload levels.
diff --git a/source/SAMRAI/mesh/GriddingAlgorithm.cpp b/source/SAMRAI/mesh/GriddingAlgorithm.cpp
index 97d6b1c55..278bc97ea 100644
--- a/source/SAMRAI/mesh/GriddingAlgorithm.cpp
+++ b/source/SAMRAI/mesh/GriddingAlgorithm.cpp
@@ -468,7 +468,7 @@ GriddingAlgorithm::makeCoarsestLevel(

    d_load_balancer0->loadBalanceBoxLevel(
       *new_box_level,
-      0,
+      &new_to_domain,
       d_hierarchy,
       ln,
       smallest_patch,
nselliott commented 11 months ago

252 implemented the fix