Closed EdwinChan closed 2 years ago
Thanks for the report! Any interest in forking the code, making your suggested changes, then opening a PR?
I think this can be easily fixed by loading each Meshblock separately. I am happy to take care of it if you could give me a few weeks.
This issue is also blocking us. Just want to double-check before we try to fix it. Did anyone start working on the fix?
I wasn't because there are multiple ways of fixing it (changing how restart files are read or issuing a more informative warning), and because @tomidakn seems interested in taking care of it.
Please take a look at #453 and please test if this version fixes your problem.
Closing as fixed for now.
Prerequisite checklist
Place an X in between the brackets on each line as you complete these checks:
Summary of issue
Restarting a simulation with fewer cores than used to create the restart file can result in a confusing error message.
Steps to reproduce
Expected outcome
An error message indicating the file is too large or there isn't enough memory; or ideally, no errors.
Actual outcome
Additional comments and/or proposed solution
The restart file isn't corrupted and the input parameters haven't changed. This error message causes unnecessary panic (data corruption is serious) and points the user in the wrong direction.
The error originates from these lines in
IOWrapper::Read_at_all()
:https://github.com/PrincetonUniversity/athena/blob/ae02dd19c3510cf4c170d2d22ef80d3979c4f2c5/src/outputs/io_wrapper.cpp#L111-L112
MPI functions can't take a count argument greater than 2³¹−1 because
int
is 32-bit andINT_MAX
== 2³¹−1 on Linux (e.g., here, here). Ifcount*size
exceeds this limit, that is, if each core reads >2 GiB of data (which is not that unreasonable), MPI returnsMPI_ERR_COUNT
(== 2 typically). The error goes away when there are just enough cores thatcount*size
falls below the limit.We can't expect custom handling for each MPI error, but swallowing the error and making a blanket claim is misleading. It'd be better to say something like "error when reading restart, in function blah, line blah, MPI error code blah," which gives the user enough information to start looking.
Alternatively, we can have an effective count greater than
INT_MAX
if we use larger data types (here).Version info