SCOREC / core

parallel finite element unstructured meshes
Other
181 stars 63 forks source link

Breaking changes with PCU #433

Open flagdanger opened 4 months ago

flagdanger commented 4 months ago

Breaking changes with PCU

jacobmerson commented 4 months ago

@cwsmith This draft pull request is our starting point for having breaking changes of core with the new PCU as discussed in Pull request #388 . The biggest thing that I don't like is the way parama group works because it still requires doing the comm switch operations.

However, my concern is that breaking that particular code will be a much more substantial thing that we want to force ourselves or the users deal with.

We are certainly open to suggestions on if there is a way out of that comm switch without totally blowing things up.

cwsmith commented 4 months ago

@jacobmerson @flagdanger Can we meet next week to briefly discuss? I see the call to switch in test/ptnParma.cc but would like to get the bigger picture view of the issue before digging into this more. I'll send an email.

cwsmith commented 3 months ago

There is a PCU usage example in the pumi users guide we should update: https://scorec.rpi.edu/pumi/PUMI.pdf (see section 12.2 page 72) The source for that users guide is here: https://github.com/SCOREC/pumi-docs/tree/master/userGuide

A pumi intro doc is here: https://github.com/SCOREC/pumi-intro we use this for the finite element programming doc.

A maintainers guide for pumi is here: https://github.com/SCOREC/core/wiki/Maintainer's-Guide it describes the process to create a release.

jacobmerson commented 3 months ago

@cwsmith Thanks for the review! This is a bit of a painful one as we had to touch so many files. Flynn is currently working on running the DeltaWing case. And once we have that sorted we will update this pull request.

@flagdanger please take a look at Cameron's review. Once you have a chance to look it over we can discuss the best approach for the areas that require more than simple changes.

flagdanger commented 3 months ago

@jacobmerson @cwsmith The DeltaWing case is running, and I'm fixing the small changes, Cameron, you've mentioned now.

flagdanger commented 3 months ago

@cwsmith Do you have a model to use to partition the deltaWing_500kMetric.smb mesh, so we can run your adaptPumi case on multiple processors?

cwsmith commented 3 months ago

@flagdanger Try passing '.null' as the argument; we don't have the CAD model for this mesh.

flagdanger commented 2 months ago

@cwsmith @jacobmerson Made changes based on Cameron's comments, I believe this is ready to be looked at again.

cwsmith commented 2 months ago

/runtests

cwsmith commented 2 months ago

/runtests

github-actions[bot] commented 2 months ago

Test Result: failure (details)

cwsmith commented 2 months ago

@flagdanger Would you please take a look at the the CI build failure with simmetrix enabled:

https://github.com/SCOREC/core/actions/runs/10118195635/job/27984611512#step:3:405

cd /lore/cwsmith/githubSelfHostedRunner/actions-runner-2.317.0/_work/_temp/build/mds && /opt/scorec/spack/rhel9/v0201_4/install/linux-rhel9-x86_64/gcc-12.3.0/mpich-4.1.1-xpoyz4tqgfxtrm6m7qq67q4ccp5pnlre/bin/mpicc -DHAVE_SIMMETRIX -DOMPI_SKIP_MPICXX -I/lore/cwsmith/githubSelfHostedRunner/actions-runner-2.317.0/_work/_temp/build/mds -I/lore/cwsmith/githubSelfHostedRunner/actions-runner-2.317.0/_work/core/core/core_433/pcu -I/lore/cwsmith/githubSelfHostedRunner/actions-runner-2.317.0/_work/core/core/core_433/pcu/reel -I/lore/cwsmith/githubSelfHostedRunner/actions-runner-2.317.0/_work/core/core/core_433/gmi -I/lore/cwsmith/githubSelfHostedRunner/actions-runner-2.317.0/_work/core/core/core_433/lion -I/lore/cwsmith/githubSelfHostedRunner/actions-runner-2.317.0/_work/core/core/core_433/apf -I/lore/cwsmith/githubSelfHostedRunner/actions-runner-2.317.0/_work/core/core/core_433/can -I/lore/cwsmith/githubSelfHostedRunner/actions-runner-2.317.0/_work/core/core/core_433/mth -g -Werror -Wall -Wextra -Wno-strict-overflow  -O3 -DNDEBUG -MD -MT mds/CMakeFiles/mds.dir/mds_apf.c.o -MF CMakeFiles/mds.dir/mds_apf.c.o.d -o CMakeFiles/mds.dir/mds_apf.c.o -c /lore/cwsmith/githubSelfHostedRunner/actions-runner-2.317.0/_work/core/core/core_433/mds/mds_apf.c
/lore/cwsmith/githubSelfHostedRunner/actions-runner-2.317.0/_work/core/core/core_433/apf_sim/apfSIM.cc: In function 'apf::Mesh2* apf::createMesh(pParMesh, pcu::PCU)':
/lore/cwsmith/githubSelfHostedRunner/actions-runner-2.317.0/_work/core/core/core_433/apf_sim/apfSIM.cc:1108:31: error: cannot convert 'pcu::PCU' to 'pcu::PCU*'
 1108 |     m->init(getSerendipity(), PCUObj);
      |                               ^~~~~~
      |                               |
      |                               pcu::PCU
In file included from /lore/cwsmith/githubSelfHostedRunner/actions-runner-2.317.0/_work/core/core/core_433/apf/apfMesh2.h:14,
                 from /lore/cwsmith/githubSelfHostedRunner/actions-runner-2.317.0/_work/core/core/core_433/apf_sim/apfSIM.h:6,
                 from /lore/cwsmith/githubSelfHostedRunner/actions-runner-2.317.0/_work/core/core/core_433/apf_sim/apfSIM.cc:1:
/lore/cwsmith/githubSelfHostedRunner/actions-runner-2.317.0/_work/core/core/core_433/apf/apfMesh.h:111:40: note:   initializing argument 2 of 'void apf::Mesh::init(apf::FieldShape*, pcu::PCU*)'
  111 |     void init(FieldShape* s, pcu::PCU *PCUObj);
      |                              ~~~~~~~~~~^~~~~~
/lore/cwsmith/githubSelfHostedRunner/actions-runner-2.317.0/_work/core/core/core_433/apf_sim/apfSIM.cc:1110:33: error: cannot convert 'pcu::PCU' to 'pcu::PCU*'
 1110 |     m->init(getLagrange(order), PCUObj);
      |                                 ^~~~~~
      |                                 |
      |                                 pcu::PCU
/lore/cwsmith/githubSelfHostedRunner/actions-runner-2.317.0/_work/core/core/core_433/apf/apfMesh.h:111:40: note:   initializing argument 2 of 'void apf::Mesh::init(apf::FieldShape*, pcu::PCU*)'
  111 |     void init(FieldShape* s, pcu::PCU *PCUObj);
      |                              ~~~~~~~~~~^~~~~~

The environment and cmake commands used for the CI build on a SCOREC RHEL9 system are here:

https://github.com/SCOREC/core/blob/master/.github/workflows/simmetrix_enabled_pr_comment_trigger_self_hosted.yml#L34-L59

flagdanger commented 2 months ago

@flagdanger Would you please take a look at the the CI build failure with simmetrix enabled:

https://github.com/SCOREC/core/actions/runs/10118195635/job/27984611512#step:3:405

cd /lore/cwsmith/githubSelfHostedRunner/actions-runner-2.317.0/_work/_temp/build/mds && /opt/scorec/spack/rhel9/v0201_4/install/linux-rhel9-x86_64/gcc-12.3.0/mpich-4.1.1-xpoyz4tqgfxtrm6m7qq67q4ccp5pnlre/bin/mpicc -DHAVE_SIMMETRIX -DOMPI_SKIP_MPICXX -I/lore/cwsmith/githubSelfHostedRunner/actions-runner-2.317.0/_work/_temp/build/mds -I/lore/cwsmith/githubSelfHostedRunner/actions-runner-2.317.0/_work/core/core/core_433/pcu -I/lore/cwsmith/githubSelfHostedRunner/actions-runner-2.317.0/_work/core/core/core_433/pcu/reel -I/lore/cwsmith/githubSelfHostedRunner/actions-runner-2.317.0/_work/core/core/core_433/gmi -I/lore/cwsmith/githubSelfHostedRunner/actions-runner-2.317.0/_work/core/core/core_433/lion -I/lore/cwsmith/githubSelfHostedRunner/actions-runner-2.317.0/_work/core/core/core_433/apf -I/lore/cwsmith/githubSelfHostedRunner/actions-runner-2.317.0/_work/core/core/core_433/can -I/lore/cwsmith/githubSelfHostedRunner/actions-runner-2.317.0/_work/core/core/core_433/mth -g -Werror -Wall -Wextra -Wno-strict-overflow  -O3 -DNDEBUG -MD -MT mds/CMakeFiles/mds.dir/mds_apf.c.o -MF CMakeFiles/mds.dir/mds_apf.c.o.d -o CMakeFiles/mds.dir/mds_apf.c.o -c /lore/cwsmith/githubSelfHostedRunner/actions-runner-2.317.0/_work/core/core/core_433/mds/mds_apf.c
/lore/cwsmith/githubSelfHostedRunner/actions-runner-2.317.0/_work/core/core/core_433/apf_sim/apfSIM.cc: In function 'apf::Mesh2* apf::createMesh(pParMesh, pcu::PCU)':
/lore/cwsmith/githubSelfHostedRunner/actions-runner-2.317.0/_work/core/core/core_433/apf_sim/apfSIM.cc:1108:31: error: cannot convert 'pcu::PCU' to 'pcu::PCU*'
 1108 |     m->init(getSerendipity(), PCUObj);
      |                               ^~~~~~
      |                               |
      |                               pcu::PCU
In file included from /lore/cwsmith/githubSelfHostedRunner/actions-runner-2.317.0/_work/core/core/core_433/apf/apfMesh2.h:14,
                 from /lore/cwsmith/githubSelfHostedRunner/actions-runner-2.317.0/_work/core/core/core_433/apf_sim/apfSIM.h:6,
                 from /lore/cwsmith/githubSelfHostedRunner/actions-runner-2.317.0/_work/core/core/core_433/apf_sim/apfSIM.cc:1:
/lore/cwsmith/githubSelfHostedRunner/actions-runner-2.317.0/_work/core/core/core_433/apf/apfMesh.h:111:40: note:   initializing argument 2 of 'void apf::Mesh::init(apf::FieldShape*, pcu::PCU*)'
  111 |     void init(FieldShape* s, pcu::PCU *PCUObj);
      |                              ~~~~~~~~~~^~~~~~
/lore/cwsmith/githubSelfHostedRunner/actions-runner-2.317.0/_work/core/core/core_433/apf_sim/apfSIM.cc:1110:33: error: cannot convert 'pcu::PCU' to 'pcu::PCU*'
 1110 |     m->init(getLagrange(order), PCUObj);
      |                                 ^~~~~~
      |                                 |
      |                                 pcu::PCU
/lore/cwsmith/githubSelfHostedRunner/actions-runner-2.317.0/_work/core/core/core_433/apf/apfMesh.h:111:40: note:   initializing argument 2 of 'void apf::Mesh::init(apf::FieldShape*, pcu::PCU*)'
  111 |     void init(FieldShape* s, pcu::PCU *PCUObj);
      |                              ~~~~~~~~~~^~~~~~

The environment and cmake commands used for the CI build on a SCOREC RHEL9 system are here:

https://github.com/SCOREC/core/blob/master/.github/workflows/simmetrix_enabled_pr_comment_trigger_self_hosted.yml#L34-L59

Yes I'll look now, thanks.

flagdanger commented 2 months ago

@cwsmith Fixed the things you mentioned, and can run tests again.

cwsmith commented 2 months ago

/runtests

flagdanger commented 2 months ago

@cwsmith Can you run again, it failed with this error: This request was automatically failed because there were no enabled runners online to process the request for more than 1 days

cwsmith commented 2 months ago

/runtests

github-actions[bot] commented 2 months ago

Test Result: failure (details)

cwsmith commented 2 months ago

@flagdanger Thanks for checking in on this.

It looks like there build errors in the phasta code:

/lore/smithc11/githubSelfHostedRunner/actions-runner-2.317.0/_work/core/core/core_433/phasta/ph.cc: In function 'apf::Mesh2* ph::loadMesh(gmi_model*&, const char*, pcu::PCU*)':
/lore/smithc11/githubSelfHostedRunner/actions-runner-2.317.0/_work/core/core/core_433/phasta/ph.cc:169:27: error: too few arguments to function 'apf::Mesh2* apf::createMesh(pParMesh, pcu::PCU*)'
  169 |     mesh = apf::createMesh(sim_mesh);

https://github.com/SCOREC/core/actions/runs/10251693152/job/28360310000#step:3:793

Please let me know if you need help setting up a build that reproduces this. I can meet on webex etc.

flagdanger commented 2 months ago

@cwsmith Thank you, could you run them again? And yes, if this fails again relating to simmetrix, then I will want to be able to reproduce these errors on my machine.

cwsmith commented 2 months ago

/runtests

github-actions[bot] commented 2 months ago

Test Result: failure (details)

cwsmith commented 2 months ago

It looks like the current errors are promoted warnings that are unrelated to this PR:

https://github.com/SCOREC/core/actions/runs/10269910071/job/28416451838#step:3:805

In member function 'virtual bool ph::PhastaSharing::isOwned(apf::MeshEntity*)',
    inlined from 'virtual bool ph::PhastaSharing::isOwned(apf::MeshEntity*)' at /lore/smithc11/githubSelfHostedRunner/actions-runner-2.317.0/_work/core/core/core_433/phasta/phLinks.cc:42:30,
    inlined from 'virtual bool ph::PhastaSharing::isOwned(apf::MeshEntity*)' at /lore/smithc11/githubSelfHostedRunner/actions-runner-2.317.0/_work/core/core/core_433/phasta/phLinks.cc:39:8,
    inlined from 'void ph::getLinks(apf::Mesh*, int, Links&, BCs&)' at /lore/smithc11/githubSelfHostedRunner/actions-runner-2.317.0/_work/core/core/core_433/phasta/phLinks.cc:111:23:
/lore/smithc11/githubSelfHostedRunner/actions-runner-2.317.0/_work/core/core/core_433/phasta/phLinks.cc:41:9: error: array subscript 'ph::PhastaSharing[0]' is partly outside array bounds of 'unsigned char [16]' [-Werror=array-bounds]
   41 |     if (isDG)
      |         ^~~~
In constructor 'ph::PhastaSharing::PhastaSharing(apf::Mesh*)',
    inlined from 'void ph::getLinks(apf::Mesh*, int, Links&, BCs&)' at /lore/smithc11/githubSelfHostedRunner/actions-runner-2.317.0/_work/core/core/core_433/phasta/phLinks.cc:100:22:
/lore/smithc11/githubSelfHostedRunner/actions-runner-2.317.0/_work/core/core/core_433/phasta/phLinks.cc:23:39: note: object of size 16 allocated by 'operator new'
   23 |     helperN = new apf::NormalSharing(m);
      |                                       ^

I'll see if I can resolve these quickly in develop. The build is successful on develop but fails on this branch.

@flagdanger Would you please take a look at the changes to phasta/phLinks.cc? I can help debug if needed.

(ins)smithc11@asics: /space/smithc11/pumiDev/core (pcu-object-breaking)$ git diff develop phasta/phLinks.cc
diff --git a/phasta/phLinks.cc b/phasta/phLinks.cc
index 683bf2c1..73a2a0ff 100644
--- a/phasta/phLinks.cc
+++ b/phasta/phLinks.cc
@@ -1,4 +1,3 @@
-#include <PCU.h>
 #include "phLinks.h"
 #include "phAdjacent.h"
 #include <apf.h>
@@ -54,7 +53,7 @@ struct PhastaSharing : public apf::Sharing {
     if ( ! mesh->hasMatching())
       return;
     /* filter out matches which are on the same part as the global master */
-    int self = PCU_Comm_Self();
+    int self = mesh->getPCU()->Self();
     size_t i = 0;
     for (size_t j = 0; j < copies.getSize(); ++j)
       if (copies[j].peer != self)
@@ -98,40 +97,39 @@ struct PhastaSharing : public apf::Sharing {

 void getLinks(apf::Mesh* m, int dim, Links& links, BCs& bcs)
 {
-  PhastaSharing* shr = new PhastaSharing(m);
-  PCU_Comm_Begin();
+  PhastaSharing shr(m);
+  m->getPCU()->Begin();
   apf::MeshIterator* it = m->begin(dim);
   apf::MeshEntity* v;
   while ((v = m->iterate(it))) {
     apf::ModelEntity* me = m->toModel(v);
-    shr->isDG = ph::isInterface(m->getModel(),(gmi_ent*) me,bcs.fields["DG interface"]);
+    shr.isDG = ph::isInterface(m->getModel(),(gmi_ent*) me,bcs.fields["DG interface"]);
 /* the alignment is such that the owner part's
    array follows the order of its vertex iterator
    traversal. The owner dictates the order to the
    other part by sending remote copies */
-    if ( ! shr->isOwned(v))
+    if ( ! shr.isOwned(v))
       continue;
     apf::CopyArray remotes;
-    shr->getCopies(v, remotes);
+    shr.getCopies(v, remotes);
     for (size_t i = 0; i < remotes.getSize(); ++i) {
       /* in matching we may accumulate multiple occurrences
          of the same master in the outgoing links array
          to a part that contains multiple copies of it. */
       links[LinkKey(1, remotes[i].peer)].push_back(v);
-      PCU_COMM_PACK(remotes[i].peer, remotes[i].entity);
+      m->getPCU()->Pack(remotes[i].peer, remotes[i].entity);
     }
   }
   m->end(it);
-  PCU_Comm_Send();
-  while (PCU_Comm_Listen()) {
-    int peer = PCU_Comm_Sender();
-    while (!PCU_Comm_Unpacked()) {
+  m->getPCU()->Send();
+  while (m->getPCU()->Listen()) {
+    int peer = m->getPCU()->Sender();
+    while (!m->getPCU()->Unpacked()) {
       apf::MeshEntity* v;
-      PCU_COMM_UNPACK(v);
+      m->getPCU()->Unpack(v);
       links[LinkKey(0, peer)].push_back(v);
     }
   }
-  delete shr;
 }

 /* encode the local links into a big array of integers
@@ -218,7 +216,7 @@ static apf::MeshEntity* getOtherElem(apf::Mesh* m, apf::MeshEntity* elem,
     return 0;
   apf::Matches matches;
   m->getMatches(face, matches);
-  int self = PCU_Comm_Self();
+  int self = m->getPCU()->Self();
   for (size_t i = 0; i < matches.getSize(); ++i)
     if (matches[i].peer == self)
       return m->getUpward(matches[i].entity, 0);
flagdanger commented 2 months ago

@cwsmith Okay I believe it is fixed.

cwsmith commented 2 months ago

/runtests

github-actions[bot] commented 2 months ago

Test Result: failure (details)

flagdanger commented 2 months ago

@cwsmith Okay can you run again, I'm not getting these errors on my machine.

cwsmith commented 2 months ago

/runtests

github-actions[bot] commented 2 months ago

Test Result: failure (details)

flagdanger commented 1 month ago

@cwsmith Could you run tests again?

cwsmith commented 1 month ago

/runtests

github-actions[bot] commented 1 month ago

Test Result: success (details)

cwsmith commented 1 month ago

@cwsmith Todo:

flagdanger commented 1 month ago

@cwsmith, I also changed the file naming of PCUObj.h and PCU.h, and added explicitly declared data types to the templated add, min, max functions. I'm not seeing in the reviews any other changes we'd talked about making, but let me know if you'd like anything done differently.

jacobmerson commented 1 month ago

/runtests

cwsmith commented 1 month ago

@jacobmerson

cwsmith commented 1 month ago

Update:

/runtests

github-actions[bot] commented 1 month ago

Simmetrix + CGNS Test Result: failure

cwsmith commented 1 month ago

It looks like adding the C++14 standard into the mix for the cgns build causes some more errors to be emitted: https://github.com/SCOREC/core/actions/runs/10604091991/job/29389889050#step:4:187

jacobmerson commented 1 month ago

@flagdanger looks like that error is just an extra semicolon

flagdanger commented 1 month ago

@jacobmerson Could you try running tests again?

cwsmith commented 1 month ago

/runtests

github-actions[bot] commented 1 month ago

Simmetrix + CGNS Test Result: failure Build Log Simmetrix Test Result: success Simmetrix + CGNS Test Result: failure

cwsmith commented 1 week ago

/runtests

github-actions[bot] commented 1 week ago

Build Log Simmetrix Test Result: success Simmetrix + CGNS Test Result: failure