amanzi / ats

Advanced Terrestrial Simulator (ATS) development
Other
47 stars 35 forks source link

"restart from checkpoint file" not working #242

Closed saubhagya-gatech closed 8 months ago

saubhagya-gatech commented 9 months ago

This is to report that the "restart from checkpoint file" feature seems to be broken for the master branch. I tested it on two different HPCs (Perlmutter and CADES) and both gave different error messages. However, when I use a checkpoint file for setting initial conditions, I do not get any error (of course we do not get fluxes for the first observation in this case).

Error on Perlmutter:

terminate called after throwing an instance of 'Errors::Message'
  what():  HDF5_MPI: error opening file "checkpoint_final.h5" with READ_WRITE access.

Error on CADES:


*** An error occurred in MPI_Allreduce
*** reported by process [216662016,1]
*** on communicator MPI_COMM_WORLD
*** MPI_ERR_TRUNCATE: message truncated
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
]***    and potentially your MPI job)
saubhagya-gatech commented 8 months ago

This is resolved. The particular built-on HPC was outdated.