bitwalkabout commented 2 weeks ago

The issue involves large magnetic divergence measured in the first row/plane of ghost zones outside the active region at an inter-mesh boundary. This issue arrises in a very specific circumstance: if left meshblock A is higher resolution than right meshblock B, and the refinement boundary moves to the right such that meshblock B matches the resolution of A, the first ghost zones have a large divergence, capable of affecting evolution. This seems to be true for all three dimensions in 3D, but only when the coarse meshblock side is refined, not when the fine meshblock is restricted.

I first came across this bug because of off-disk issues running the native Kerr-Schild coordinates with AMR turned on (forced to be axisymmetric). The recent divergence fixes related to curvilinear coordinates fixed other instances of large divergence -- namely, inside the active domain -- but I later saw hydrodynamic discontinuities that appeared like divergence-led errors so I checked inside the ghost zones and found the divergence there.

I hypothesized a cause that would be generic to flat spacetime, cartesian coordinates, so turned attention to the rayleigh taylor 3d problem with AMR.

To replicate, consider commit 1591aab84ba7055e5b356a8f069695ea451af8a0, running the RT with the following configuration:

configure.py -b --prob rt --flux hlle --nghost 4 -mpi -hdf5 -h5double

I change the refinement condition in rt.cpp to:

\-  if (drmax > 1.5) return 1;
\-  else if (drmax < 1.2) return -1;
\+  if (drmax > 1.1) return 1;
\+  else if (drmax < 1.05) return -1;

and, rather than looking at every meshblock output individually, to get a rough look at the ghost zone divergence, I calculate the divergence for all active cells -3 (shifting left), and write this to a user defined output variable. This is the only other change to rt.cpp.

I use athinput.rt3d with only a few changes:

\> I output to hdf5
\> xorder ->3
\> for speed  I reduced resolution to
nx1 = 36, nx2=6, nx3=36
\> I turned on mesh refinement:
refinement = adaptive
numlevel       = 3
derefine_count    = 5

\<meshblock\>
nx1 = 6  
nx2 = 6  
nx3 = 6  

\> also changed to: 
iprob = 3
amp   = 0.5
drat  = 1.4
b0 = .0001
angle = 30.

The Kerr-Schild issue appears as divergence lines (log10(abs(divergence of B, normalized in usual way by sqrt(bsq) and cell volumes/lengths))):

A cut through the 6th (so really the 3rd) active cell in the output in the x-z plane for the 3d rt problem shows similar behavior: Though, I can't yet account for the alternating nature of the divergence in this case. Given the AMR tree structure, I'd guess it's related to that but I am not sure.

My hypothesis for what is happening is the following:

On the coarse side of meshblock B I was describing before, prolongation happens at the faces of cells. Even in KS, with the updates from a few months back, the total magnetic flux summed/integrated across the coarse cell surface, or the four new, is the same. However, at the single face shared between A and B, the prolongation on B will result in 4 values that are different than the four sharing the fine faces on A. These values should be identical, and everything about the logic of the code for constrained transport relies on the assumption that the two meshblocks and their relevant CPUs hold a copy of this identical value, when is then identically changed through emfs ensured to be the same through the flux correction protocols.

However, because refinement broke this assumption, divergence exists in the first ghost zones and the behavior will no longer maintain consistent CT evolution. Because the prolongation is fairly accurate to the magnetic structure, the values are not absurdly off, so the hydrodynamic effect can be subtle, unless (as I exactly see in the accretion disk case) the refinement embeds a divergence associated with large fields and a high density region, but then the flow can move some and lead to the same divergence in a region of much less true hydrodynamic pressure. I have a video of this affect on hydrodynamic evolution if it's of interested -- showing a discontinuity in density and other quantities at the location of the refinement boundary before it moved.

The fix would be simple -- use the copies of the four face B field values that B can reveive and, once, upon refinement, overwrite the four prolongated values. Perhaps a few other values of flux on faces shared between those newly refined edge cells should also need to be adjusted to accomodate the overwriting values to maintain the constraint.

I tried to implement such a fix but the comminication of face magnetic field values as it interweaves the AMR code was just too complex for me to quickly get a handle on. @tomidakn Perhaps you can point me to the changes necessary for a quick test of this idea? Or, perhaps I'm misunderstanding some of the shared face B-field update procedure.

It appears to me that the refinement happens at the end of the main loop, and I couldn't find an overwrite of the new face fields like this in the code (of course, maybe I missed it), but that would support the idea that this is the cause of the issue.

I compile and run with intel 20.0.4 and mvapich2. i can get specific verisons if necessary. Some flags I compiled with for the cases above (though it is generic to at least some other flag choices, like removal of O3):

CXXFLAGS := -O3 -std=c++11 -fp-model=strict -qno-openmp-simd

bitwalkabout commented 2 weeks ago

Just an additional comment. I think that because the divergence isn't too too bad, in tests where the average flow properties don't trend down in time too much, which is all of them? I think one might still get passing convergence tests, etc. One might not even notice if one didn't look at divergence in the ghost zones themselves. Though, as I said, I still can't explain why it's only in the ghost zones for the KS test but alternates first ghost zones and active zones in the cartesain RT test.

bitwalkabout commented 2 weeks ago

I think I've convinced myself that after overwriting the new face-centered values with the block A values previously active, then one has to solve for and overwrite the values of the newly prolongated face-centered fluxes on the meshblock B faces shared between each four newly refined cells. Four vars, four constraint equations. Does the function ProlongateSharedFieldX[1,2,3] have access to the shared inter-block face values from block A? I'm unsure of the var these would be stored in at this point in code evolution and how to access it from this function.

tomidakn commented 2 weeks ago

Dear @bitwalkabout

Thank you very much for catching this and detailed report. It was very helpful. After quick inspection, I do agree that this is a serious issue and must be fixed.

We need to transfer and set the shared face fields from neighbor MeshBlocks that are already refined before ProlonateInternalField. While it is straightforward, it is a little bit tricky to implement as the communication infrastructure is being reconstructed at this point. I have an idea, but I am currently quite busy. Can someone help me if I give directions how to do that?

bitwalkabout commented 2 weeks ago

@tomidakn thanks for taking a look and responding so quickly!

I am not sure what would be more computationally efficient, but one could, I think, in lieu of changing the code to transfer the field values on the shared face only when refinement in this direction occurs, reconstruct the shared-face values using each cell's other faces. If I understand correctly, this could be done using the face information already exchanged (though perhaps the ordering of refinement of the block and exchange are out of step in a way that would be hard to circumvent). In order to ensure exact value duplication on both blocks, the shared face could be reconstructed on both blocks.

Certainly this isn't as clean as sending those extra face values between blocks, but it might be quicker to implement in the version currently in master branch, and then the better solution could be implemented along with other changes for the new comm updates you mentioned?

tomidakn commented 2 weeks ago

In the current implementation, MeshBlock B does not have the four shared-face fields from MeshBlock A before refinement, and after refinement (now they are on the same resolution), the shared fields are not transferred between them. So anyway we need to change the communication.

Also, when we change the shared-face fields, we have to recalculate the other face fields of the first active cells in order to satisfy the divergence condition in the active cells.

bitwalkabout commented 2 weeks ago

Oh, I see, before the timestep involving AMR refinement, during each inter-block communication call, restriction takes place on Meshblock A before filling the comm buffer to send to B? I was misremembering, thinking that the restriction was done on the receiving Meshblock B.

If the latter were the case, clarifying what I meant before: Though the four shared-faced fields are not transfered, the other 16 unique face field values (not shared with block B but shared amongst the ghost zones for block B) of those four ghost cells, are communicated -- or, at least, the infrastructure seems to be in place to perform a inter-block communication to fill those values with the new ones from the timestep for which this refinement occurs at the end. Those 16 values could be used to reconstruct the currently uncommunciated shared (between A and B) face values. This reconstruction could be done on both block A and B so that the same reconstructed value is used moving forward. Certainly not as clean as adding the new communication but potentially a quick fix to the older version of the code.

As you said, if restriction is happening before, then I agree, no other way than to add a way of communicating the extra data, then deal with updating the first active cells so that the new prolongated values preserve divB for those cells.

tomidakn commented 2 weeks ago

Correct, the shared field on A is first restricted and then transferred to B (to save the data size).

I think your suggestion may work, but I am still concerned that we may need to carefully apply the correction when edge and corner neighbors are refined.

My idea is to transfer and set the shared fine fields before ProlongateInternalField during the mesh refinement process, so that we do not have to do the calculation twice.

tomidakn commented 1 week ago

I spent my weekend for this issue and I think now it works. I pushed it to the amrfix branch. As I am currently very busy, I cannot cover many test cases, so @bitwalkabout can you try it? I guess the latest commit with using unordered_map (hash) should be faster, but I appreciate if you could try the older one as well.

First, it identifies pairs of contacting MeshBlocks, one on the coarser level and to be refined. Then, it sends shared face fields from the finer MeshBlock to the coarser one. On the coarser MeshBlock, it uses the received fields for the shared face instead of ones prolongated from the coarser level.

@jmstone @pgrete @jdolence I would like to draw your attention, because this probably affects AthenaK, AthenaPK and KAthena.

bitwalkabout commented 1 week ago

That’s great, thank you so much! I will perform some test runs as soon as I can. Possibly Monday, but I’m helping my girlfriend move this weekend so can’t look before then.

tomidakn commented 1 week ago

No problem, thank you for your help.

pgrete commented 1 week ago

Thanks for the heads-up.

@lroberts36 does this still apply to Parthenon given the ownership model?

lroberts36 commented 1 week ago

@pgrete, I believe Parthenon should not have this problem. During refinement, we prolongate from the old, coarse block to the shared faces of the new fine block. Then we communicate boundaries and faces shared between blocks, assuming that the fine blocks that already exist "own" the shared data. After that, prolongation occurs on faces that were not shared with the coarse block.

IIRC, @bprather ran into a similar (maybe the same) issue when testing out the face field implementation in Parthenon. We had to modify the ownership model slightly to make things work (i.e. giving precedence to pre-existing blocks at a given level, but only during post-refinement communication).

tomidakn commented 1 week ago

@lroberts36 @pgrete It sounds like Parthenon already took care of this issue.

pgrete commented 1 week ago

Great, thanks for confirming @lroberts36 Now we just need to implement CT in AthenaPK ;)

bitwalkabout commented 1 week ago

@tomidakn I've found some time to start testing (working with the RT setup first) and there do seem to be remaining issues. Both the third new commit and the first fail after running for a while (maybe a memory de/allocation or mpi-comm buffer allocation error? just a guess). Though the divergence is now fine under most circumstances that were originally the issue, there is still divergence showing up -- eventually -- when the fixup is applied. My guess would be the logic of the fix is implemented in a working manner, but there's a small bug that is causing both the runtime error and breaking the new flux fixes on occasion, or after a certain number of cycles. I'm trying to find out more details about where it's failing, and will share if I find anything telling.

tomidakn commented 1 week ago

OK, I will take a look, but it may take a while as I am very busy right now. I tested only a few timesteps, so the long-term behavior was not tested.

Can you share your test set up?

bitwalkabout commented 1 week ago

Test setup is the same as the 3d RT setup I described in my first post, but I have changed drat to 2 so the evolution is a bit quicker, and am currently looking at only one amr refinement over level 0.

tomidakn commented 1 week ago

I ran valgrind and no memory error was detected for a serial run.

Can you explain how it crashed? Was it an MPI error, segmentation fault or something else? Also I'd appreciate if you could explain under what situation and where the divB error appeared.

bitwalkabout commented 1 week ago

@tomidakn I've included some responses below:

Unfortunately, I'm not getting a consistent error with every run. For xorder = 3, with the 3D RT setup, your first commit, running with MPI, and making the rotation of the field 90 deg and reflecting boundary conditions (since, as I understand it, your first commit isn't compatible with periodic boundaries?), and otherwise the RT setup described just above, I get the error message upon crash:

[atl1-1-02-017-29-1.pace.gatech.edu:mpi_rank_32][MPIDI_CH3_Abort] Fatal error in MPI_Irecv: Invalid rank, error stack: MPI_Irecv(153): MPI_Irecv(buf=0x36f9f30, count=36, MPI_DOUBLE, src=113, tag=1057, MPI_COMM_WORLD, request=0x2c6cda0) failed MPI_Irecv(98).: Invalid rank has value 113 but must be nonnegative and less than 36 : No such file or directory (2)

In the same setup, but xorder = 2 and two meshblocks in the 2nd dimension, rather than 1, it fails on the error I've added checking pmb was not null, which was of concern to me based on some gdb traceback results I was getting:

// Step FFC4. Pack and send shared face fields
for (FaceFieldCorrection &f : ffcsend) { MeshBlock *pmb = FindMeshBlock(f.from); int p = 0; if (pmb == nullptr) { std::cerr << "Error: MeshBlock not found for f.from" << std::endl; continue; // Skip this iteration
} for (FaceField &var_fc : pmb->varsfc) { switch (f.face) { case BoundaryFace::inner_x1: BufferUtility::PackData(var_fc.x1f, f.buf, pmb->is, pmb->is, ...

Regarding the divergence -- it seems to now occur only on x1 and x2 faces (at least for this problem setup) (or possibly travelling in that direction along faces from block corners during prolongation?), and happens along shared block faces between blocks when they are both refined. In other words, when the refinement boundary moves in the x3 direction, extending to include two newly refined blocks that share a x1 or x2 face, there is divergence along that face.

Could you remind me of how shared faces between blocks that are both newly refined are gauranteed to be the same? Either that is also a problematic area that needs to be similar addressed with it's own fixup routine, or I've simply forgotten the step where those are handled from when I was going through that part of the code.

tomidakn commented 1 week ago

The algorithm does not depend on xorder, so let us focus on xorder=2. Also let's focus on the latest commit, and put aside the divergence issue. The MPI issue is more fundamental and critical. However, I could not reproduce your issue with the RT problem even after 2000 steps on my computer using 36 ranks and with periodic boundaries.

By the way, it is of crucial importance to keep all the optimization options (particularly -ipo for intel). Otherwise it can mess up vectorization. If you want to try something conservative, please configure the code with -d option.

bitwalkabout commented 1 week ago

@tomidakn I typically notice the problem arrise closer to 10k steps or even a few times that -- this is for the case of the first commit, non-periodic boundaries, etc, that I described running above. Up to that time, aside from the divb issue, there was no MPI issue apparent. Maybe just try running longer. If you got to 50k steps with my small adjustments to the RT setup, including the block sizes and total resolution, and physical parameters setting the ultimate rate at which the refinement levels move, then I'd start to suspect machine differences. Tomorrow I will compile and run on a TACC machine to make sure the MPI issue isn't something local to our cluster at GT.

Regarding the optimizations, using the configure command in my original post, and the configure.py file in these commits, no "ipo" flags are set when using the -mpi flag (using mpicxx to compile). Do you have something different? The latter flags, strict and qno-omp... that I run with are from reading through prior issues opened by others, and testing of my own, because I got non-deterministic behavior at runtime if I left all the default vectorization and omp flags as-is. I can try running with -ipo flag added, though and will report back.

I will also try the latest commit again, and report anything I can from running with gdb.

tomidakn commented 6 days ago

-fp-model=strict -qno-openmp-simd should be fine, I just wanted to make sure that correct compiler options are important.

bitwalkabout commented 5 days ago

@tomidakn Were you able to replicate the failure by running that setup longer?

bitwalkabout commented 5 days ago

@tomidakn And any others who might be working on the issue:

I have replicated this on Stampede3 with newer intel compilers. Whether I run with the most recent commit on AMRFIX branch or the first, I get the error at ~7000 steps (coarse resolution 36x6x36, with meshblocks of 6x6x6 for most recent tests) and the same backtrace route, to the same line (backtrace results on master rank):

(gdb) bt

0 0x00000000005d1a54 in gnu_cxx::normal_iterator<std::reference_wrapper*, std::vector<std::reference_wrapper, std::allocator<std::reference_wrapper > > >::normal_iterator (this=0x7fffffff7578, i=)

at /opt/apps/gcc/13.2.0/lib/gcc/x86_64-pc-linux-gnu/13.2.0/../../../../include/c++/13.2.0/bits/stl_iterator.h:1077

1 0x00000000005cf9b9 in std::vector<std::reference_wrapper, std::allocator<std::reference_wrapper > >::begin (this=0x1e8) at /opt/apps/gcc/13.2.0/lib/gcc/x86_64-pc-linux-gnu/13.2.0/../../../../include/c++/13.2.0/bits/stl_vector.h:871

2 0x00000000005cc21d in Mesh::PrepareAndSendFaceFieldCorrection (this=0x97a830, newloc=0x3a7ccf0, ranklist=0x34b1510, newrank=0x3719ed0, nslist=0x97eee0, nbtold=190) at src/mesh/amr_loadbalance.cpp:1397

3 0x00000000005c8ff1 in Mesh::RedistributeAndRefineMeshBlocks (this=0x97a830, pin=0x976a40, ntot=204) at src/mesh/amr_loadbalance.cpp:551

4 0x00000000005c6ae9 in Mesh::LoadBalancingAndAdaptiveMeshRefinement (this=0x97a830, pin=0x976a40) at src/mesh/amr_loadbalance.cpp:56

5 0x00000000004ceb8d in main (argc=5, argv=0x7fffffff83a8) at src/main.cpp:539

if pmb is null, which it seems to be on occasion, then the for loop is failing. It seems unlikely that it would be consistent with the logic of the new code additions, but I will try adding an exception to skip these portions when there are no meshblocks assigned to pmb here by FindMeshBlock(). possibly extraneous info: If I do a restart with a different number of tasks, right before the failure, the failure does not happen at the same time but often is delayed.

Without sitting down over zoom and at least having a quick discussion about the design of the new additions, that's about the most I can do at this point, but I will keep trying since this is holding up a main project line we have at Georgia Tech. I am happy to find a time to meet with anyone available (whatever time would be convenient regardless of timezone). Thanks again for your help!

tomidakn commented 5 days ago

OK. Somehow I could not reproduce your issue as all the MeshBlocks are derefined to the root level after ~2000 steps. Can you share your input and pgen files?

It sounds like a resource management issue, probably related to MPI.

bitwalkabout commented 5 days ago

@tomidakn The two requested files can be found here:

https://drive.google.com/drive/folders/1GKC3KmQWqOjxPW0hMkbJmheC8ro0Kmwy?usp=sharing

Just to double check, had you run with the athinput file changes I mentioned above? -- I'm pretty sure that with the default physical setup, default perturbation, and B-field strength, I also got early evolution (at this relatively low resolution) that showed sufficient mixing of the contrasting densities at the interface that the higher refinement level dropped away. Hence I made a few changes to diminish B-field stabilization of high-k modes, increase rate of growth, and increase density contrast. Those are included, of course, in the linked file. Also included in the pgen file will be the user output additions for the offset divB user output variable.

tomidakn commented 5 days ago

I thought I changed them, but maybe I did not. I use your files.

tomidakn commented 5 days ago

OK, I found it. The hashmap was not cleared correctly, and wrong meshblocks were identified in certain situations. I only checked it ran without crashing, so can you test quantitatively? I hope this also fixes the divergence error.

I've got an idea to improve the AMR performance a little bit, so I will update it when I have time.

bitwalkabout commented 5 days ago

@tomidakn Great! I don't want to jinx it, but a quick repeat of the test I've been running (with two different numbers of cores) and it is quite a bit past the prior point of failure and continuing to run. Also, the divergence issue I saw before is not present despite most of the domain having been refined by a level. I think we've got it!

I'll do a more thorough test suite later today, with more levels of refinement and test that it works in the original problematic context of GR, but I wanted to get a quick reply to you.

tomidakn commented 5 days ago

Great, please keep me updated.

bitwalkabout commented 3 days ago

Divergence and the MPI failure from before do seem to be fixed, both in the RT testing setup and the Kerr-Schild disk problem. However, now that I apply mesh refinement during a restart (i.e. I have a changed refinement criterion upon restart), in the newly refined regions, I have pixels, somewhat randomly distributed, correlated with low-density (or, higher bsq/rho or lower plasma beta) structures, that seem to be reset to the floor incorrectly. So, it seems like either the interpolation of primitives (at least density, i'm checking others) has large error in some cells, or some cells aren't being filled correctly, and the floors are kicking in.

However, this is a restart of a rst file from a slightly older commit of master branch, using the newest master branch commits. Has backward compatibility been tested?

I'm going to try now doing restart tests (new commit -> new commit and old commit->new commit) in the RT context to see if the issue appears there as well.

I did notice that upon restart, when the refinement occurs, I was not getting the normal notification in output that is associated with large AMR step changes -- 'might have to adjust to new meshblock count' or however it's exactly phrased. Not sure if this is inidicative in a helpful way or not.

bitwalkabout commented 3 days ago

I haven't double checked the ordering yet, but depending on the step at which reconstruction of face-centered velocity, and thus associated fluxes takes place, is that going to have similar issues to the divB issue now solved? The velocity primitive values in the problematic cells during restart are 0 in the dump after the first step after refinement (even now that I've set things up so refinement happens several steps after restart). If the fluxes are way off for the first step after refinement then that could explain why the density is off. At least, that seems consistent with what I'm seeing.

tomidakn commented 3 days ago

I'm sorry but I do not get your point. Please provide figures and how to reproduce the issue. If it happens only with GR, @c-white knows better than me.

While we have not changed the restart format for a long time, we do not guarantee compatibility between versions.

In the initialization from a restart file, the code does not perform refinement (assuming that refinement is already processed before the end of the previous run). So if you change the refinement condition after restart, it should happen only after the first step.

Athena++ does not solve face-centered velocities as variables. They are just reconstructed from cell-centered variables in the reconstruction step before calling a Riemann Solver. So as long as cell centered variables are correctly refined, they should be fine. Note a restart file does not contain primitive variables for non-GR runs, and primitives are calculated from conservative during initialization. For GR runs, primitive variables are loaded from a restart file as GR requires initial guess for variable inversion.

PrincetonUniversity / athena

inter-block divergence issue when refinining under certain circumstances, generic to all coordinates #617

0 0x00000000005d1a54 in gnu_cxx::normal_iterator<std::reference_wrapper*, std::vector<std::reference_wrapper, std::allocator<std::reference_wrapper > > >::normal_iterator (this=0x7fffffff7578, i=)

1 0x00000000005cf9b9 in std::vector<std::reference_wrapper, std::allocator<std::reference_wrapper > >::begin (this=0x1e8) at /opt/apps/gcc/13.2.0/lib/gcc/x86_64-pc-linux-gnu/13.2.0/../../../../include/c++/13.2.0/bits/stl_vector.h:871

2 0x00000000005cc21d in Mesh::PrepareAndSendFaceFieldCorrection (this=0x97a830, newloc=0x3a7ccf0, ranklist=0x34b1510, newrank=0x3719ed0, nslist=0x97eee0, nbtold=190) at src/mesh/amr_loadbalance.cpp:1397

3 0x00000000005c8ff1 in Mesh::RedistributeAndRefineMeshBlocks (this=0x97a830, pin=0x976a40, ntot=204) at src/mesh/amr_loadbalance.cpp:551

4 0x00000000005c6ae9 in Mesh::LoadBalancingAndAdaptiveMeshRefinement (this=0x97a830, pin=0x976a40) at src/mesh/amr_loadbalance.cpp:56

5 0x00000000004ceb8d in main (argc=5, argv=0x7fffffff83a8) at src/main.cpp:539