PrincetonUniversity / athena

Athena++ radiation GRMHD code and adaptive mesh refinement (AMR) framework
https://www.athena-astro.app
BSD 3-Clause "New" or "Revised" License
226 stars 122 forks source link

Fixing the restart error reported in #449 #453

Closed tomidakn closed 2 years ago

tomidakn commented 2 years ago

This is a patch to fix the error in restart reported in #449. The original implementation loads all MeshBlocks at once from a restart file. This version reads the data one by one in order to avoid hitting the 2GB limit of MPI_File_read.

Prerequisite checklist

Please review the CONTRIBUTING.md file for detailed guidelines.

Testing and validation

I tested the code by changing the number of processes and it seems working. However, as I tested only for a few steps and Athena++ does not support bitwise reproducibility anyway, I could not cover sufficient test cases. So I would like to request @Edwinchan and @kahoooo to test it thoroughly. Also, this version can be slower than the original as it calls MPI_File_read many times.

EdwinChan commented 2 years ago

Thanks, @tomidakn, for the patch.

I'm not sure I can come up with comprehensive tests either. I simply tested the patch against my one problem generator on my local Kubuntu 22.04 machine.

My problem is Newtonian MHD. The input file is tweaked so that the total number of bytes read during restart, according to count*size, is slightly larger than 2³¹−1 (see https://github.com/PrincetonUniversity/athena/issues/449#issue-1316071702). Specifically, I adopt a 320×320×256 grid divided into 40×40×32 blocks; this implies 2304704512 bytes read and ~17 GiB memory used during the run. First I run the simulation for 10 steps, then I rerun it from the zeroth restart file. For either half of the process, I can choose to utilize 1 core or 8 cores (i.e., mpiexec -n 1 or mpiexec -n 8). The initial runs are labeled 1/‒ or 8/‒ reflecting the core count, and the restart runs 1/1, 1/8, 8/1, or 8/8 depending on the core counts in the two halves respectively. The outputs are double-precision HDF5 and restart files.

Before the patch is applied (commit 91154cd92f8d8dd171f2cc4eaf0ece126b3152d7):

After the patch is applied (commit 8e876ec99dad05c4ee5b2e12d9b59a04721f4826):

It seems the core count affects the writing of restart files as well, but not of HDF5 files. Once the data is in memory, the simulation proceeds normally.

There should be a cleaner way to test this.

tomidakn commented 2 years ago

Thank you for your help, it indeed sounds like I need to fix the writing side. I'll take care of it tomorrow.

tomidakn commented 2 years ago

OK, I updated the code, so please try it again. If possible, please measure the performance of the restart output, as the new version may be slower when there are more than one MeshBlocks per process.

Also I noticed that HDF5 fails in the same way due to the size limit of MPI-IO, but this is more complicated fix.

EdwinChan commented 2 years ago

With the newest patch (commit fd8838c4a0e25babc3647762f00a46a73bf2073a), all six tests produce bitwise-identical HDF5 files according to h5diff.

Reading and writing are timed by wrapping

MPI_Barrier(MPI_COMM_WORLD);
auto const start_time = std::chrono::high_resolution_clock::now();

// code to be instrumented

MPI_Barrier(MPI_COMM_WORLD);
auto const end_time = std::chrono::high_resolution_clock::now();
std::chrono::duration<double> const elapsed_time = end_time - start_time;
if (Globals::my_rank == 0)
{
  std::ios_base::fmtflags const f = std::cout.flags();
  std::streamsize const p = std::cout.precision();
  std::cout
    << "read " // or "write "
    << std::fixed
    << std::setprecision(std::numeric_limits<Real>::max_digits10)
    << elapsed_time.count()
    << std::endl;
  std::cout.flags(f);
  std::cout.precision(p);
}

around

https://github.com/PrincetonUniversity/athena/blob/fd8838c4a0e25babc3647762f00a46a73bf2073a/src/main.cpp#L273

https://github.com/PrincetonUniversity/athena/blob/fd8838c4a0e25babc3647762f00a46a73bf2073a/src/outputs/outputs.cpp#L791

For commit 91154cd92f8d8dd171f2cc4eaf0ece126b3152d7, the following appears in stdout:

# 1/‒

write 2.85688341200000018
write 2.68509195300000014
cycle=0 time=0.0000000000000000e+00 dt=4.6069977974951898e-05
write 3.01376590699999980
cycle=1 time=4.6069977974951898e-05 dt=9.2139955949903797e-05
write 3.00765479600000019
cycle=2 time=1.3820993392485569e-04 dt=1.3620636704643780e-04
write 2.95880921500000005
cycle=3 time=2.7441630097129350e-04 dt=1.3620641356661955e-04
write 2.47136143300000022
cycle=4 time=4.1062271453791302e-04 dt=1.3620649406077381e-04
write 2.49380397199999981
cycle=5 time=5.4682920859868680e-04 dt=1.3620660758606893e-04
write 2.43229536099999999
cycle=6 time=6.8303581618475576e-04 dt=1.3620675335288809e-04
write 2.52017927999999980
cycle=7 time=8.1924256953764380e-04 dt=1.3620693069544177e-04
write 2.44394460299999983
cycle=8 time=9.5544950023308556e-04 dt=1.3620713892923235e-04
write 2.40547068500000005
cycle=9 time=1.0916566391623180e-03 dt=1.3620737725993331e-04
write 2.43481400599999986
cycle=10 time=1.2278640164222513e-03 dt=1.3620764520974962e-04
write 2.36243144800000016
write 2.25674553499999986
# 1/8

read 2.77587808600000008
cycle=0 time=0.0000000000000000e+00 dt=9.2139955949903797e-05
write 5.79613393500000029
cycle=1 time=9.2139955949903797e-05 dt=1.8427991189980759e-04
write 5.21660158300000010
cycle=2 time=2.7641986784971139e-04 dt=3.6855982379961519e-04
write 3.85742325899999994
cycle=3 time=6.4497969164932652e-04 dt=7.3711964759923037e-04
write 2.63718628800000010
cycle=4 time=1.3820993392485569e-03 dt=1.4608836670405305e-03
write 2.18363685399999996
cycle=5 time=2.8429830062890876e-03 dt=1.5463241927739127e-03
write 1.31849474699999991
cycle=6 time=4.3893071990630007e-03 dt=1.4112370634306805e-03
write 1.33856850600000010
cycle=7 time=5.8005442624936808e-03 dt=1.5234273488151030e-03
write 24.69607051199999859
cycle=8 time=7.3239716113087834e-03 dt=1.5754872447729892e-03
write 1.30464077499999997
cycle=9 time=8.8994588560817722e-03 dt=1.4940537825668737e-03
write 1.39821116000000001
cycle=10 time=1.0393512638648645e-02 dt=1.4258866537184393e-03
write 2.44033701599999997
write 13.53793887299999987
# 8/‒

write 1.70791116700000001
write 2.02368670200000000
cycle=0 time=0.0000000000000000e+00 dt=4.6069977974951898e-05
write 4.10066783200000007
cycle=1 time=4.6069977974951898e-05 dt=9.2139955949903797e-05
write 1.70061407800000008
cycle=2 time=1.3820993392485569e-04 dt=1.3620636704643780e-04
write 22.44233465499999980
cycle=3 time=2.7441630097129350e-04 dt=1.3620641356661955e-04
write 1.69510627500000011
cycle=4 time=4.1062271453791302e-04 dt=1.3620649406077381e-04
write 1.65771553499999991
cycle=5 time=5.4682920859868680e-04 dt=1.3620660758606893e-04
write 1.67478461400000000
cycle=6 time=6.8303581618475576e-04 dt=1.3620675335288809e-04
write 1.71517355699999996
cycle=7 time=8.1924256953764380e-04 dt=1.3620693069544177e-04
write 1.68274165500000006
cycle=8 time=9.5544950023308556e-04 dt=1.3620713892923235e-04
write 1.70476190699999997
cycle=9 time=1.0916566391623180e-03 dt=1.3620737725993331e-04
write 1.74579471199999992
cycle=10 time=1.2278640164222513e-03 dt=1.3620764520974962e-04
write 1.60764617699999990
write 1.80862639100000000
# 8/8

read 35.58767814099999782
cycle=0 time=0.0000000000000000e+00 dt=4.6069977974951898e-05
write 1.68869585199999994
cycle=1 time=4.6069977974951898e-05 dt=9.2139955949903797e-05
write 1.70253844299999990
cycle=2 time=1.3820993392485569e-04 dt=1.3620636704643780e-04
write 21.94872496099999992
cycle=3 time=2.7441630097129350e-04 dt=1.3620641356661955e-04
write 1.69473269499999990
cycle=4 time=4.1062271453791302e-04 dt=1.3620649406077381e-04
write 1.72963343399999991
cycle=5 time=5.4682920859868680e-04 dt=1.3620660758606893e-04
write 1.73287071500000001
cycle=6 time=6.8303581618475576e-04 dt=1.3620675335288809e-04
write 1.72461685899999995
cycle=7 time=8.1924256953764380e-04 dt=1.3620693069544177e-04
write 1.78230561299999990
cycle=8 time=9.5544950023308556e-04 dt=1.3620713892923235e-04
write 1.71037024399999993
cycle=9 time=1.0916566391623180e-03 dt=1.3620737725993331e-04
write 1.69709692299999992
cycle=10 time=1.2278640164222513e-03 dt=1.3620764520974962e-04
write 1.66515598799999998
write 1.78225885000000006

For commit fd8838c4a0e25babc3647762f00a46a73bf2073a, the following appears in stdout:

# 1/‒

write 8.24727053700000035
write 2.18598581600000008
cycle=0 time=0.0000000000000000e+00 dt=4.6069977974951898e-05
write 2.92008659300000017
cycle=1 time=4.6069977974951898e-05 dt=9.2139955949903797e-05
write 2.94029370500000020
cycle=2 time=1.3820993392485569e-04 dt=1.3620636704643780e-04
write 2.86723000600000022
cycle=3 time=2.7441630097129350e-04 dt=1.3620641356661955e-04
write 3.00653477999999996
cycle=4 time=4.1062271453791302e-04 dt=1.3620649406077381e-04
write 2.93282153299999981
cycle=5 time=5.4682920859868680e-04 dt=1.3620660758606893e-04
write 2.92439550299999995
cycle=6 time=6.8303581618475576e-04 dt=1.3620675335288809e-04
write 2.83137117900000002
cycle=7 time=8.1924256953764380e-04 dt=1.3620693069544177e-04
write 2.91801124500000020
cycle=8 time=9.5544950023308556e-04 dt=1.3620713892923235e-04
write 2.82958649399999995
cycle=9 time=1.0916566391623180e-03 dt=1.3620737725993331e-04
write 2.78455825200000007
cycle=10 time=1.2278640164222513e-03 dt=1.3620764520974962e-04
write 2.96037788300000004
write 5.48945161600000020
# 1/1

read 10.46596617299999998
cycle=0 time=0.0000000000000000e+00 dt=4.6069977974951898e-05
write 2.44640130800000000
cycle=1 time=4.6069977974951898e-05 dt=9.2139955949903797e-05
write 2.32525672600000011
cycle=2 time=1.3820993392485569e-04 dt=1.3620636704643780e-04
write 2.45055142399999992
cycle=3 time=2.7441630097129350e-04 dt=1.3620641356661955e-04
write 2.42916570800000020
cycle=4 time=4.1062271453791302e-04 dt=1.3620649406077381e-04
write 2.31343221699999990
cycle=5 time=5.4682920859868680e-04 dt=1.3620660758606893e-04
write 2.41942592400000001
cycle=6 time=6.8303581618475576e-04 dt=1.3620675335288809e-04
write 2.33430586699999987
cycle=7 time=8.1924256953764380e-04 dt=1.3620693069544177e-04
write 2.31584342500000018
cycle=8 time=9.5544950023308556e-04 dt=1.3620713892923235e-04
write 2.28887759400000013
cycle=9 time=1.0916566391623180e-03 dt=1.3620737725993331e-04
write 2.39081437299999999
cycle=10 time=1.2278640164222513e-03 dt=1.3620764520974962e-04
write 2.41528312199999995
write 4.99107331200000015
# 1/8

read 24.45879943999999995
cycle=0 time=0.0000000000000000e+00 dt=4.6069977974951898e-05
write 1.38723885799999991
cycle=1 time=4.6069977974951898e-05 dt=9.2139955949903797e-05
write 1.44676646599999992
cycle=2 time=1.3820993392485569e-04 dt=1.3620636704643780e-04
write 1.27677226699999991
cycle=3 time=2.7441630097129350e-04 dt=1.3620641356661955e-04
write 4.80661324400000023
cycle=4 time=4.1062271453791302e-04 dt=1.3620649406077381e-04
write 1.26860121499999989
cycle=5 time=5.4682920859868680e-04 dt=1.3620660758606893e-04
write 1.35016974599999995
cycle=6 time=6.8303581618475576e-04 dt=1.3620675335288809e-04
write 1.31868180100000010
cycle=7 time=8.1924256953764380e-04 dt=1.3620693069544177e-04
write 1.37847347800000009
cycle=8 time=9.5544950023308556e-04 dt=1.3620713892923235e-04
write 1.39447341699999994
cycle=9 time=1.0916566391623180e-03 dt=1.3620737725993331e-04
write 2.01367026500000001
cycle=10 time=1.2278640164222513e-03 dt=1.3620764520974962e-04
write 8.21415501399999926
write 5.26242040400000022
# 8/‒

write 1.89373354400000005
write 2.04032539000000002
cycle=0 time=0.0000000000000000e+00 dt=4.6069977974951898e-05
write 1.77035240400000005
cycle=1 time=4.6069977974951898e-05 dt=9.2139955949903797e-05
write 1.88750216299999996
cycle=2 time=1.3820993392485569e-04 dt=1.3620636704643780e-04
write 1.84067213600000001
cycle=3 time=2.7441630097129350e-04 dt=1.3620641356661955e-04
write 1.53160766499999990
cycle=4 time=4.1062271453791302e-04 dt=1.3620649406077381e-04
write 1.64788669900000007
cycle=5 time=5.4682920859868680e-04 dt=1.3620660758606893e-04
write 1.61608015100000002
cycle=6 time=6.8303581618475576e-04 dt=1.3620675335288809e-04
write 5.50133934900000021
cycle=7 time=8.1924256953764380e-04 dt=1.3620693069544177e-04
write 1.72845654199999998
cycle=8 time=9.5544950023308556e-04 dt=1.3620713892923235e-04
write 1.61341143300000001
cycle=9 time=1.0916566391623180e-03 dt=1.3620737725993331e-04
write 1.59550550300000005
cycle=10 time=1.2278640164222513e-03 dt=1.3620764520974962e-04
write 2.04725704200000003
write 1.80603998300000002
# 8/1

read 33.10635870899999844
cycle=0 time=0.0000000000000000e+00 dt=4.6069977974951898e-05
write 3.12800003499999990
cycle=1 time=4.6069977974951898e-05 dt=9.2139955949903797e-05
write 2.95792124600000017
cycle=2 time=1.3820993392485569e-04 dt=1.3620636704643780e-04
write 2.73799392000000008
cycle=3 time=2.7441630097129350e-04 dt=1.3620641356661955e-04
write 6.46900679699999959
cycle=4 time=4.1062271453791302e-04 dt=1.3620649406077381e-04
write 2.89173132599999994
cycle=5 time=5.4682920859868680e-04 dt=1.3620660758606893e-04
write 2.92215940300000021
cycle=6 time=6.8303581618475576e-04 dt=1.3620675335288809e-04
write 2.52838300300000007
cycle=7 time=8.1924256953764380e-04 dt=1.3620693069544177e-04
write 2.68814539800000007
cycle=8 time=9.5544950023308556e-04 dt=1.3620713892923235e-04
write 2.78328105199999998
cycle=9 time=1.0916566391623180e-03 dt=1.3620737725993331e-04
write 2.68832376699999998
cycle=10 time=1.2278640164222513e-03 dt=1.3620764520974962e-04
write 4.15888177899999967
write 1.93092347199999992
# 8/8

read 10.12237845399999969
cycle=0 time=0.0000000000000000e+00 dt=4.6069977974951898e-05
write 3.77284238799999994
cycle=1 time=4.6069977974951898e-05 dt=9.2139955949903797e-05
write 1.29569966599999997
cycle=2 time=1.3820993392485569e-04 dt=1.3620636704643780e-04
write 1.31252910800000011
cycle=3 time=2.7441630097129350e-04 dt=1.3620641356661955e-04
write 1.33423492599999993
cycle=4 time=4.1062271453791302e-04 dt=1.3620649406077381e-04
write 12.97872216999999928
cycle=5 time=5.4682920859868680e-04 dt=1.3620660758606893e-04
write 1.23230557800000007
cycle=6 time=6.8303581618475576e-04 dt=1.3620675335288809e-04
write 1.39366523800000008
cycle=7 time=8.1924256953764380e-04 dt=1.3620693069544177e-04
write 1.32595267499999991
cycle=8 time=9.5544950023308556e-04 dt=1.3620713892923235e-04
write 1.82522785399999998
cycle=9 time=1.0916566391623180e-03 dt=1.3620737725993331e-04
write 1.36312481499999993
cycle=10 time=1.2278640164222513e-03 dt=1.3620764520974962e-04
write 12.71595353000000017
write 12.04388003400000073

Timing results are fairly inconsistent even though the machine under test is left alone as much as possible. Strangely, the load increases by more than 8 even though at most 8 cores are used.

tomidakn commented 2 years ago

Thank you for your help. The performance seems slightly degraded as I expected, but I'd think it is acceptable, at least for the moment. We should keep an eye of it for massively parallel runs, though.

So, once @kahoooo agrees too, we can close this issue.

kahoooo commented 2 years ago

Hi @tomidakn and @EdwinChan, sorry for the delay. I am just able to test it on our problem. I can verify that this PR also solves the error in our case. However, I didn't benchmark the I/O performance but I am happy to do so if needed.

felker commented 2 years ago

Ready to merge @tomidakn ?

tomidakn commented 2 years ago

Sorry, I forgot to merge this.