Open flagdanger opened 4 months ago
@cwsmith This draft pull request is our starting point for having breaking changes of core with the new PCU as discussed in Pull request #388 . The biggest thing that I don't like is the way parama group works because it still requires doing the comm switch operations.
However, my concern is that breaking that particular code will be a much more substantial thing that we want to force ourselves or the users deal with.
We are certainly open to suggestions on if there is a way out of that comm switch without totally blowing things up.
@jacobmerson @flagdanger Can we meet next week to briefly discuss? I see the call to switch in test/ptnParma.cc
but would like to get the bigger picture view of the issue before digging into this more. I'll send an email.
There is a PCU usage example in the pumi users guide we should update: https://scorec.rpi.edu/pumi/PUMI.pdf (see section 12.2 page 72) The source for that users guide is here: https://github.com/SCOREC/pumi-docs/tree/master/userGuide
A pumi intro doc is here: https://github.com/SCOREC/pumi-intro we use this for the finite element programming doc.
A maintainers guide for pumi is here: https://github.com/SCOREC/core/wiki/Maintainer's-Guide it describes the process to create a release.
@cwsmith Thanks for the review! This is a bit of a painful one as we had to touch so many files. Flynn is currently working on running the DeltaWing case. And once we have that sorted we will update this pull request.
@flagdanger please take a look at Cameron's review. Once you have a chance to look it over we can discuss the best approach for the areas that require more than simple changes.
@jacobmerson @cwsmith The DeltaWing case is running, and I'm fixing the small changes, Cameron, you've mentioned now.
@cwsmith Do you have a model to use to partition the deltaWing_500kMetric.smb mesh, so we can run your adaptPumi case on multiple processors?
@flagdanger Try passing '.null' as the argument; we don't have the CAD model for this mesh.
@cwsmith @jacobmerson Made changes based on Cameron's comments, I believe this is ready to be looked at again.
/runtests
/runtests
Test Result: failure (details)
@flagdanger Would you please take a look at the the CI build failure with simmetrix enabled:
https://github.com/SCOREC/core/actions/runs/10118195635/job/27984611512#step:3:405
cd /lore/cwsmith/githubSelfHostedRunner/actions-runner-2.317.0/_work/_temp/build/mds && /opt/scorec/spack/rhel9/v0201_4/install/linux-rhel9-x86_64/gcc-12.3.0/mpich-4.1.1-xpoyz4tqgfxtrm6m7qq67q4ccp5pnlre/bin/mpicc -DHAVE_SIMMETRIX -DOMPI_SKIP_MPICXX -I/lore/cwsmith/githubSelfHostedRunner/actions-runner-2.317.0/_work/_temp/build/mds -I/lore/cwsmith/githubSelfHostedRunner/actions-runner-2.317.0/_work/core/core/core_433/pcu -I/lore/cwsmith/githubSelfHostedRunner/actions-runner-2.317.0/_work/core/core/core_433/pcu/reel -I/lore/cwsmith/githubSelfHostedRunner/actions-runner-2.317.0/_work/core/core/core_433/gmi -I/lore/cwsmith/githubSelfHostedRunner/actions-runner-2.317.0/_work/core/core/core_433/lion -I/lore/cwsmith/githubSelfHostedRunner/actions-runner-2.317.0/_work/core/core/core_433/apf -I/lore/cwsmith/githubSelfHostedRunner/actions-runner-2.317.0/_work/core/core/core_433/can -I/lore/cwsmith/githubSelfHostedRunner/actions-runner-2.317.0/_work/core/core/core_433/mth -g -Werror -Wall -Wextra -Wno-strict-overflow -O3 -DNDEBUG -MD -MT mds/CMakeFiles/mds.dir/mds_apf.c.o -MF CMakeFiles/mds.dir/mds_apf.c.o.d -o CMakeFiles/mds.dir/mds_apf.c.o -c /lore/cwsmith/githubSelfHostedRunner/actions-runner-2.317.0/_work/core/core/core_433/mds/mds_apf.c
/lore/cwsmith/githubSelfHostedRunner/actions-runner-2.317.0/_work/core/core/core_433/apf_sim/apfSIM.cc: In function 'apf::Mesh2* apf::createMesh(pParMesh, pcu::PCU)':
/lore/cwsmith/githubSelfHostedRunner/actions-runner-2.317.0/_work/core/core/core_433/apf_sim/apfSIM.cc:1108:31: error: cannot convert 'pcu::PCU' to 'pcu::PCU*'
1108 | m->init(getSerendipity(), PCUObj);
| ^~~~~~
| |
| pcu::PCU
In file included from /lore/cwsmith/githubSelfHostedRunner/actions-runner-2.317.0/_work/core/core/core_433/apf/apfMesh2.h:14,
from /lore/cwsmith/githubSelfHostedRunner/actions-runner-2.317.0/_work/core/core/core_433/apf_sim/apfSIM.h:6,
from /lore/cwsmith/githubSelfHostedRunner/actions-runner-2.317.0/_work/core/core/core_433/apf_sim/apfSIM.cc:1:
/lore/cwsmith/githubSelfHostedRunner/actions-runner-2.317.0/_work/core/core/core_433/apf/apfMesh.h:111:40: note: initializing argument 2 of 'void apf::Mesh::init(apf::FieldShape*, pcu::PCU*)'
111 | void init(FieldShape* s, pcu::PCU *PCUObj);
| ~~~~~~~~~~^~~~~~
/lore/cwsmith/githubSelfHostedRunner/actions-runner-2.317.0/_work/core/core/core_433/apf_sim/apfSIM.cc:1110:33: error: cannot convert 'pcu::PCU' to 'pcu::PCU*'
1110 | m->init(getLagrange(order), PCUObj);
| ^~~~~~
| |
| pcu::PCU
/lore/cwsmith/githubSelfHostedRunner/actions-runner-2.317.0/_work/core/core/core_433/apf/apfMesh.h:111:40: note: initializing argument 2 of 'void apf::Mesh::init(apf::FieldShape*, pcu::PCU*)'
111 | void init(FieldShape* s, pcu::PCU *PCUObj);
| ~~~~~~~~~~^~~~~~
The environment and cmake commands used for the CI build on a SCOREC RHEL9 system are here:
@flagdanger Would you please take a look at the the CI build failure with simmetrix enabled:
https://github.com/SCOREC/core/actions/runs/10118195635/job/27984611512#step:3:405
cd /lore/cwsmith/githubSelfHostedRunner/actions-runner-2.317.0/_work/_temp/build/mds && /opt/scorec/spack/rhel9/v0201_4/install/linux-rhel9-x86_64/gcc-12.3.0/mpich-4.1.1-xpoyz4tqgfxtrm6m7qq67q4ccp5pnlre/bin/mpicc -DHAVE_SIMMETRIX -DOMPI_SKIP_MPICXX -I/lore/cwsmith/githubSelfHostedRunner/actions-runner-2.317.0/_work/_temp/build/mds -I/lore/cwsmith/githubSelfHostedRunner/actions-runner-2.317.0/_work/core/core/core_433/pcu -I/lore/cwsmith/githubSelfHostedRunner/actions-runner-2.317.0/_work/core/core/core_433/pcu/reel -I/lore/cwsmith/githubSelfHostedRunner/actions-runner-2.317.0/_work/core/core/core_433/gmi -I/lore/cwsmith/githubSelfHostedRunner/actions-runner-2.317.0/_work/core/core/core_433/lion -I/lore/cwsmith/githubSelfHostedRunner/actions-runner-2.317.0/_work/core/core/core_433/apf -I/lore/cwsmith/githubSelfHostedRunner/actions-runner-2.317.0/_work/core/core/core_433/can -I/lore/cwsmith/githubSelfHostedRunner/actions-runner-2.317.0/_work/core/core/core_433/mth -g -Werror -Wall -Wextra -Wno-strict-overflow -O3 -DNDEBUG -MD -MT mds/CMakeFiles/mds.dir/mds_apf.c.o -MF CMakeFiles/mds.dir/mds_apf.c.o.d -o CMakeFiles/mds.dir/mds_apf.c.o -c /lore/cwsmith/githubSelfHostedRunner/actions-runner-2.317.0/_work/core/core/core_433/mds/mds_apf.c /lore/cwsmith/githubSelfHostedRunner/actions-runner-2.317.0/_work/core/core/core_433/apf_sim/apfSIM.cc: In function 'apf::Mesh2* apf::createMesh(pParMesh, pcu::PCU)': /lore/cwsmith/githubSelfHostedRunner/actions-runner-2.317.0/_work/core/core/core_433/apf_sim/apfSIM.cc:1108:31: error: cannot convert 'pcu::PCU' to 'pcu::PCU*' 1108 | m->init(getSerendipity(), PCUObj); | ^~~~~~ | | | pcu::PCU In file included from /lore/cwsmith/githubSelfHostedRunner/actions-runner-2.317.0/_work/core/core/core_433/apf/apfMesh2.h:14, from /lore/cwsmith/githubSelfHostedRunner/actions-runner-2.317.0/_work/core/core/core_433/apf_sim/apfSIM.h:6, from /lore/cwsmith/githubSelfHostedRunner/actions-runner-2.317.0/_work/core/core/core_433/apf_sim/apfSIM.cc:1: /lore/cwsmith/githubSelfHostedRunner/actions-runner-2.317.0/_work/core/core/core_433/apf/apfMesh.h:111:40: note: initializing argument 2 of 'void apf::Mesh::init(apf::FieldShape*, pcu::PCU*)' 111 | void init(FieldShape* s, pcu::PCU *PCUObj); | ~~~~~~~~~~^~~~~~ /lore/cwsmith/githubSelfHostedRunner/actions-runner-2.317.0/_work/core/core/core_433/apf_sim/apfSIM.cc:1110:33: error: cannot convert 'pcu::PCU' to 'pcu::PCU*' 1110 | m->init(getLagrange(order), PCUObj); | ^~~~~~ | | | pcu::PCU /lore/cwsmith/githubSelfHostedRunner/actions-runner-2.317.0/_work/core/core/core_433/apf/apfMesh.h:111:40: note: initializing argument 2 of 'void apf::Mesh::init(apf::FieldShape*, pcu::PCU*)' 111 | void init(FieldShape* s, pcu::PCU *PCUObj); | ~~~~~~~~~~^~~~~~
The environment and cmake commands used for the CI build on a SCOREC RHEL9 system are here:
Yes I'll look now, thanks.
@cwsmith Fixed the things you mentioned, and can run tests again.
/runtests
@cwsmith Can you run again, it failed with this error: This request was automatically failed because there were no enabled runners online to process the request for more than 1 days
/runtests
Test Result: failure (details)
@flagdanger Thanks for checking in on this.
It looks like there build errors in the phasta code:
/lore/smithc11/githubSelfHostedRunner/actions-runner-2.317.0/_work/core/core/core_433/phasta/ph.cc: In function 'apf::Mesh2* ph::loadMesh(gmi_model*&, const char*, pcu::PCU*)':
/lore/smithc11/githubSelfHostedRunner/actions-runner-2.317.0/_work/core/core/core_433/phasta/ph.cc:169:27: error: too few arguments to function 'apf::Mesh2* apf::createMesh(pParMesh, pcu::PCU*)'
169 | mesh = apf::createMesh(sim_mesh);
https://github.com/SCOREC/core/actions/runs/10251693152/job/28360310000#step:3:793
Please let me know if you need help setting up a build that reproduces this. I can meet on webex etc.
@cwsmith Thank you, could you run them again? And yes, if this fails again relating to simmetrix, then I will want to be able to reproduce these errors on my machine.
/runtests
Test Result: failure (details)
It looks like the current errors are promoted warnings that are unrelated to this PR:
https://github.com/SCOREC/core/actions/runs/10269910071/job/28416451838#step:3:805
In member function 'virtual bool ph::PhastaSharing::isOwned(apf::MeshEntity*)',
inlined from 'virtual bool ph::PhastaSharing::isOwned(apf::MeshEntity*)' at /lore/smithc11/githubSelfHostedRunner/actions-runner-2.317.0/_work/core/core/core_433/phasta/phLinks.cc:42:30,
inlined from 'virtual bool ph::PhastaSharing::isOwned(apf::MeshEntity*)' at /lore/smithc11/githubSelfHostedRunner/actions-runner-2.317.0/_work/core/core/core_433/phasta/phLinks.cc:39:8,
inlined from 'void ph::getLinks(apf::Mesh*, int, Links&, BCs&)' at /lore/smithc11/githubSelfHostedRunner/actions-runner-2.317.0/_work/core/core/core_433/phasta/phLinks.cc:111:23:
/lore/smithc11/githubSelfHostedRunner/actions-runner-2.317.0/_work/core/core/core_433/phasta/phLinks.cc:41:9: error: array subscript 'ph::PhastaSharing[0]' is partly outside array bounds of 'unsigned char [16]' [-Werror=array-bounds]
41 | if (isDG)
| ^~~~
In constructor 'ph::PhastaSharing::PhastaSharing(apf::Mesh*)',
inlined from 'void ph::getLinks(apf::Mesh*, int, Links&, BCs&)' at /lore/smithc11/githubSelfHostedRunner/actions-runner-2.317.0/_work/core/core/core_433/phasta/phLinks.cc:100:22:
/lore/smithc11/githubSelfHostedRunner/actions-runner-2.317.0/_work/core/core/core_433/phasta/phLinks.cc:23:39: note: object of size 16 allocated by 'operator new'
23 | helperN = new apf::NormalSharing(m);
| ^
I'll see if I can resolve these quickly in develop. The build is successful on develop but fails on this branch.
@flagdanger Would you please take a look at the changes to phasta/phLinks.cc
? I can help debug if needed.
(ins)smithc11@asics: /space/smithc11/pumiDev/core (pcu-object-breaking)$ git diff develop phasta/phLinks.cc
diff --git a/phasta/phLinks.cc b/phasta/phLinks.cc
index 683bf2c1..73a2a0ff 100644
--- a/phasta/phLinks.cc
+++ b/phasta/phLinks.cc
@@ -1,4 +1,3 @@
-#include <PCU.h>
#include "phLinks.h"
#include "phAdjacent.h"
#include <apf.h>
@@ -54,7 +53,7 @@ struct PhastaSharing : public apf::Sharing {
if ( ! mesh->hasMatching())
return;
/* filter out matches which are on the same part as the global master */
- int self = PCU_Comm_Self();
+ int self = mesh->getPCU()->Self();
size_t i = 0;
for (size_t j = 0; j < copies.getSize(); ++j)
if (copies[j].peer != self)
@@ -98,40 +97,39 @@ struct PhastaSharing : public apf::Sharing {
void getLinks(apf::Mesh* m, int dim, Links& links, BCs& bcs)
{
- PhastaSharing* shr = new PhastaSharing(m);
- PCU_Comm_Begin();
+ PhastaSharing shr(m);
+ m->getPCU()->Begin();
apf::MeshIterator* it = m->begin(dim);
apf::MeshEntity* v;
while ((v = m->iterate(it))) {
apf::ModelEntity* me = m->toModel(v);
- shr->isDG = ph::isInterface(m->getModel(),(gmi_ent*) me,bcs.fields["DG interface"]);
+ shr.isDG = ph::isInterface(m->getModel(),(gmi_ent*) me,bcs.fields["DG interface"]);
/* the alignment is such that the owner part's
array follows the order of its vertex iterator
traversal. The owner dictates the order to the
other part by sending remote copies */
- if ( ! shr->isOwned(v))
+ if ( ! shr.isOwned(v))
continue;
apf::CopyArray remotes;
- shr->getCopies(v, remotes);
+ shr.getCopies(v, remotes);
for (size_t i = 0; i < remotes.getSize(); ++i) {
/* in matching we may accumulate multiple occurrences
of the same master in the outgoing links array
to a part that contains multiple copies of it. */
links[LinkKey(1, remotes[i].peer)].push_back(v);
- PCU_COMM_PACK(remotes[i].peer, remotes[i].entity);
+ m->getPCU()->Pack(remotes[i].peer, remotes[i].entity);
}
}
m->end(it);
- PCU_Comm_Send();
- while (PCU_Comm_Listen()) {
- int peer = PCU_Comm_Sender();
- while (!PCU_Comm_Unpacked()) {
+ m->getPCU()->Send();
+ while (m->getPCU()->Listen()) {
+ int peer = m->getPCU()->Sender();
+ while (!m->getPCU()->Unpacked()) {
apf::MeshEntity* v;
- PCU_COMM_UNPACK(v);
+ m->getPCU()->Unpack(v);
links[LinkKey(0, peer)].push_back(v);
}
}
- delete shr;
}
/* encode the local links into a big array of integers
@@ -218,7 +216,7 @@ static apf::MeshEntity* getOtherElem(apf::Mesh* m, apf::MeshEntity* elem,
return 0;
apf::Matches matches;
m->getMatches(face, matches);
- int self = PCU_Comm_Self();
+ int self = m->getPCU()->Self();
for (size_t i = 0; i < matches.getSize(); ++i)
if (matches[i].peer == self)
return m->getUpward(matches[i].entity, 0);
@cwsmith Okay I believe it is fixed.
/runtests
Test Result: failure (details)
@cwsmith Okay can you run again, I'm not getting these errors on my machine.
/runtests
Test Result: failure (details)
@cwsmith Could you run tests again?
/runtests
Test Result: success (details)
@cwsmith Todo:
@cwsmith, I also changed the file naming of PCUObj.h and PCU.h, and added explicitly declared data types to the templated add, min, max functions. I'm not seeing in the reviews any other changes we'd talked about making, but let me know if you'd like anything done differently.
/runtests
@jacobmerson
runtests
. That change should merge into master in the morning and you can try `runtests' again.Update:
/lore
didn't appear to support. A support ticket was created to record this issue; 146981./space
to avoid this./runtests
Simmetrix + CGNS Test Result: failure
It looks like adding the C++14 standard into the mix for the cgns build causes some more errors to be emitted: https://github.com/SCOREC/core/actions/runs/10604091991/job/29389889050#step:4:187
@flagdanger looks like that error is just an extra semicolon
@jacobmerson Could you try running tests again?
/runtests
Simmetrix + CGNS Test Result: failure Build Log Simmetrix Test Result: success Simmetrix + CGNS Test Result: failure
/runtests
Build Log Simmetrix Test Result: success Simmetrix + CGNS Test Result: failure
Breaking changes with PCU