SCOREC / core

parallel finite element unstructured meshes
Other
178 stars 63 forks source link

Verify Failure with Mulitphase Case #64

Open KennethEJansen opened 7 years ago

KennethEJansen commented 7 years ago

verifyField does not survive mulitphase inputs. When the call from verify is commented out, the case goes through fine. An example illustrating the problem can be found at http://fluid.colorado.edu/~kjansen/Weblink/ChefTestMultiphase.tgz

ibaned commented 7 years ago

@seegyoung is responsible for verifyField (if we are thinking of the same function). She should probably debug this, possibly with help from @cwsmith to compile and run. Also, please describe what "multiphase inputs" means.

ibaned commented 7 years ago

The tarball that @KennethEJansen provides should be turned into a nightly test. Thats the only way to prevent it breaking. Unless its big and slow, make it a nightly test ASAP.

KennethEJansen commented 7 years ago

multphase inputs simply means 5 flow variables plus 2 scalars. What is perhaps more of an issue is that the fields are on a matched mesh with periodicity (that is the mesh matches and the fields have a matched condition) and it seems that Verify does not properly handle that.


Kenneth E. Jansen, Professor of Aerospace Engineering Sciences ECAE 161 OFFICE (303) 492-4359 429 UCB FAX (303) 492-4990 University of Colorado at Boulder jansenke@colorado.edu Boulder, CO, 80309-0429 http://www.colorado.edu/aerospace


On Jan 4, 2017, at 7:47 PM, Dan Ibanez notifications@github.com wrote:

@seegyoung is responsible for verifyField (if we are thinking of the same function). She should probably debug this, possibly with help from @cwsmith to compile and run. Also, please describe what "multiphase inputs" means.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or mute the thread.

ibaned commented 7 years ago

yea, I'm willing to bet the real trouble lies with periodicity. Still, this sounds like stuff that should work fine, so just needs debugging.

KennethEJansen commented 7 years ago

That was my hope. Sadness for not doing this when you first asked for such a case to protect against this sort of breakage.

The case is quite small. 28k tetrahedra. It runs in seconds.


Kenneth E. Jansen, Professor of Aerospace Engineering Sciences ECAE 161 OFFICE (303) 492-4359 429 UCB FAX (303) 492-4990 University of Colorado at Boulder jansenke@colorado.edu Boulder, CO, 80309-0429 http://www.colorado.edu/aerospace


On Jan 4, 2017, at 7:51 PM, Dan Ibanez notifications@github.com wrote:

The tarball that @KennethEJansen provides should be turned into a nightly test. Thats the only way to prevent it breaking. Unless its big and slow, make it a nightly test ASAP.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

seegyoung commented 7 years ago

Thanks for the report. I will take a look and resolve the issue.

seegyoung commented 7 years ago

Removed field data check from apf::Verify as it crashes if there's no field data to exchange between remote/ghost copies. Instead, added pumi_field_verify in pumi.h.

KennethEJansen commented 7 years ago

Thank you Seegyoung. I will pull a fresh version of core down and rebuild to see if I can stop using my hacked code that bypassed verify (and also getInterfaceBlocks that was addressed earlier). I doubt this has interplay with the ticket I started last night but of course it would be be nice if it resolved that too.


Kenneth E. Jansen, Professor Ann and H.J. Smead Department of Aerospace Engineering Sciences ECAE 161 OFFICE (303) 492-4359 429 UCB FAX (303) 492-4990 University of Colorado at Boulder jansenke@colorado.edu Boulder, CO, 80309-0429 http://www.colorado.edu/aerospace


On Jan 27, 2017, at 9:04 AM, seegyoung notifications@github.com wrote:

Closed #64.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

seegyoung commented 7 years ago

You are welcome. The issue with mesh verification must have nothing to do with your new issue.

KennethEJansen commented 7 years ago

It looks like this “fix” creates a new problem for cases that do partitioning from serial with matching.

Ther

I tried this out on Issue #68 which starts from a serial mesh with matching and parts it to 4 (something that worked previously). It now crashes and from the call stack this is what I find: Rank 0 gets to

verify->verifyFields->sendFieldData->getSharing->MatchedSharing->apf:MatchedSharing

Cameron confirms that sendFieldData is inside of a mesh entity iterator for verifyFields and this ends up calling PCU_Comm_Begin() which should not happen.

Here are the screenshots to verify flow path.

[cid:74D7C84D-98EE-4986-9E64-15970304121C][cid:7E183ABA-5406-4D73-819D-2C4E92E34F42]

Below is more detail you can read or not but contains my understanding of the situation:

which gets to line 917 of apfMesh.cchttp://apfMesh.cc and calls PCU_Comm_Begin() on rank 0 but the other ranks are still in SwitchToAll PCU_Switch_Comm so I get the following error:

Were you expecting that this verify patch that goes down the getSharing path would never be called for partitioning cases? If getSharing is trying to find the P2P sharing of the mesh, I am wondering if perhaps it should be bypassed when there is only one part currently (since there can be no sharing)?

To be clear, my hack which bypassed verifyFields (e.g.,

void verify(Mesh* m, bool abort_on_error) { double t0 = PCU_Time(); verifyTags(m); //verifyFields(m); verifyNumberings(m);

runs through fine (suppose that is obvious since it bypasses the issue) and those are the results I reported in issue #68. Of course if there is an issue with the matching fields that influences issue #68 it would be great to NOT do this bypass but, from this test, it does not look like it is working correctly for cases with partitioning (e.g, when some ranks are idled while the pre-partitioned communicator, serial in this case, does the work prior to partitioning to the final size).


Kenneth E. Jansen, Professor Ann and H.J. Smead Department of Aerospace Engineering Sciences ECAE 161 OFFICE (303) 492-4359 429 UCB FAX (303) 492-4990 University of Colorado at Boulder jansenke@colorado.edumailto:jansenke@colorado.edu Boulder, CO, 80309-0429 http://www.colorado.edu/aerospace

On Jan 27, 2017, at 9:04 AM, seegyoung notifications@github.com<mailto:notifications@github.com> wrote:

Closed #64.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

seegyoung commented 7 years ago

I assume you are using the master branch. I updated the develop branch so it doesn't verify the field data and numbering (https://github.com/SCOREC/core/blob/develop/apf/apfVerify.cc#L799). We, developers, are not supposed to modify master branch directly.

The develop branch will be merged to the master branch after it goes through the daily regression testing at 5pm. Please turn off "verifyFields(m)" meanwhile.

seegyoung commented 7 years ago

I will open a new issue ticket "failing field data verification" with mesh with matchedSharing. Calling PCU_Comm_Begin inside matchedSharing operation causes crash inside the verification.