enzo-project / enzo-e

A version of Enzo designed for exascale and built on charm++.
Other
29 stars 35 forks source link

New-Style Restart Broken For Unigrid HydroSolvers #316

Open mabruzzo opened 1 year ago

mabruzzo commented 1 year ago

Last week, I started working on creating new versions of the checkpoint restart tests for the automated testing infrastructure, and I was focusing my attention on the new checkpoint-restart infrastructure.

It appears that there are a number of bugs related to Checkpoint-Restart with Unigrid Hydro Simulations. I have attached 2 simplified examples (1 of them uses PPM and the other uses VL+CT).

  1. test_vl_restart.tar.gz
  2. test_ppm_restart.tar.gz

For each example, you need to untar the file, cd into the untarred directory, modify the CHARMRUN and ENZOE_BIN paths at the top of run_example.sh, and then you can call bash run_example.sh.

There are a few problems:

  1. If you try to run this on the main branch, the restart fails with an error.
  2. If you try to run this on the branch with the changes from PR #313, with Method:check:include_ghosts set to true, the restart fails with a similar error.
  3. The restart will only work if you use the branch with the changes from PR #313, with Method:check:include_ghosts set to false. However, restart clearly messes up when it tries to read in field data. You can see this by invoking python3 check_file.py (included in each example). This shows that the active zone for the density field on one of the blocks is totally wrong after it gets read back in.

As an aside: the example checkpoint-restart files, input/Checkpoint/test_cosmo-check.in and input/Checkpoint/test_cosmo-restart.in don't actually use a hydro-solver. So I'm not sure if this problem also occurs in that case...

@jobordner am I doing anything obviously wrong?

jobordner commented 1 year ago

I'm guessing it's related to using non-cubical blocks (! (nx == ny == nz)); I didn't test that case, and HDF5 flips the axes around which could be related to the problem. (If blocks are cubical it should work fine, I've been using it at scale with PPM+Grackle+Gravity and restarting with different processor counts with no apparent issues.) I'll look into it.

jobordner commented 1 year ago

Checkpoint relies on "order_morton", which in turn (currently) requires a cubical blocking since it coarsens to negative levels and assumes it bottoms out at a single block.

mabruzzo commented 1 year ago

Great! Thanks for figuring that out!

I'm going to leave this issue open since this should probably be addressed before the version 2 release.