PrincetonUniversity / athena

Athena++ radiation GRMHD code and adaptive mesh refinement (AMR) framework
https://www.athena-astro.app
BSD 3-Clause "New" or "Revised" License
226 stars 122 forks source link

Restart with too few cores produces confusing error #449

Closed EdwinChan closed 2 years ago

EdwinChan commented 2 years ago

Prerequisite checklist

Place an X in between the brackets on each line as you complete these checks:

Summary of issue

Restarting a simulation with fewer cores than used to create the restart file can result in a confusing error message.

Steps to reproduce

  1. Write a restart file with many cores
  2. Read the restart file with a smaller number of cores (on an interactive node for debugging purposes, for example) such that each core reads >2 GiB of data

Expected outcome

An error message indicating the file is too large or there isn't enough memory; or ideally, no errors.

Actual outcome

### FATAL ERROR in Mesh constructor
The restart file is broken or input parameters are inconsistent.

Additional comments and/or proposed solution

The restart file isn't corrupted and the input parameters haven't changed. This error message causes unnecessary panic (data corruption is serious) and points the user in the wrong direction.

The error originates from these lines in IOWrapper::Read_at_all():

https://github.com/PrincetonUniversity/athena/blob/ae02dd19c3510cf4c170d2d22ef80d3979c4f2c5/src/outputs/io_wrapper.cpp#L111-L112

MPI functions can't take a count argument greater than 2³¹−1 because int is 32-bit and INT_MAX == 2³¹−1 on Linux (e.g., here, here). If count*size exceeds this limit, that is, if each core reads >2 GiB of data (which is not that unreasonable), MPI returns MPI_ERR_COUNT (== 2 typically). The error goes away when there are just enough cores that count*size falls below the limit.

We can't expect custom handling for each MPI error, but swallowing the error and making a blanket claim is misleading. It'd be better to say something like "error when reading restart, in function blah, line blah, MPI error code blah," which gives the user enough information to start looking.

Alternatively, we can have an effective count greater than INT_MAX if we use larger data types (here).

Version info

felker commented 2 years ago

Thanks for the report! Any interest in forking the code, making your suggested changes, then opening a PR?

tomidakn commented 2 years ago

I think this can be easily fixed by loading each Meshblock separately. I am happy to take care of it if you could give me a few weeks.

kahoooo commented 2 years ago

This issue is also blocking us. Just want to double-check before we try to fix it. Did anyone start working on the fix?

EdwinChan commented 2 years ago

I wasn't because there are multiple ways of fixing it (changing how restart files are read or issuing a more informative warning), and because @tomidakn seems interested in taking care of it.

tomidakn commented 2 years ago

Please take a look at #453 and please test if this version fixes your problem.

felker commented 2 years ago

Closing as fixed for now.