IAS-Astrophysics / athenak

Performance-portable version of the Athena++ astrophysical AMR-MHD code using Kokkos.
BSD 3-Clause "New" or "Revised" License
33 stars 21 forks source link

Trying to understand binary dumps when data sizes straddle 2^31 bytes #404

Open jmstone opened 10 months ago

jmstone commented 10 months ago

In GitLab by @c-white on Dec 2, 2023, 10:57

I just want to check that there isn't a subtle possibility of a hanging bug. Consider the code for outputting full dumps:

https://gitlab.com/theias/hpc/jmstone/athena-parthenon/athenak/-/blob/master/src/outputs/binary.cpp#L226-261

If the rank has more than 2^31 bytes of data to write, it will break up the write and do 1 MeshBlock at a time, but this decision is based on a local value. What if rank 0 has 3 MeshBlocks, pushing it over the 2^31 threshold, but rank 1 has 2 MeshBlocks, keeping it under? It seems rank 1 will take the first branch and trigger a single MPI_file_write_at_all, while rank 0 will take the second branch. Rank 0 will then trigger 2 MPI_file_write_at_all calls followed by a single MPI_file_write_at for its 3 MeshBlocks. It seems this will hang with an unbalanced number of collective writes.

jmstone commented 8 months ago

In GitLab by @jmstone216 on Feb 25, 2024, 12:11

As an update to this issue, the binary files should be writing Reals, not bytes so as to increase the maximum size that can be written, and to get rid of what looks like an unnecessary memcopy. This should be done as part of an overall update of the IOWrapper class to read/write any_type, which was started when particle outputs were added.