beatrixparis / connectivity-modeling-system

The CMS is a multiscale stochastic Lagrangian framework developed by Paris' Lab at the Rosenstiel School of Marine, Atmospheric & Earth Science to study complex behaviors, giving probabilistic estimates of dispersion, connectivity, fate of pollutants, and other Lagrangian phenomena. This repository facilitates community contributions to CMS modules
https://beatrixparis.github.io/connectivity-modeling-system/
GNU General Public License v3.0
31 stars 25 forks source link

Error when running ./cms with MPI | Error in the netCDF file, NetCDF invalid dimension or name, floating-point exceptions are signalling #52

Closed silvaglx closed 3 weeks ago

silvaglx commented 3 weeks ago

Hi everyone,

I've been using CMS during the past months and had no problems compiling and running CMS without MPI. Recently, I started the next step of my research in which I'll perform some simulations in an HPC environment, and therefore need the usage of MPI. However, whenever I try running ./cms (with or without mpirun) I receive the following error message:

 Error in the netCDF file: expt_tsubame_test/nests/nest_1_20050101000000.nc
 NetCDF: Invalid dimension ID or name
Note: The following floating-point exceptions are signalling: IEEE_INVALID_FLAG IEEE_DIVIDE_BY_ZERO IEEE_UNDERFLOW_FLAG IEEE_DENORMAL
--------------------------------------------------------------------------
prterun has exited due to process rank 13 with PID 3469561 on node r19n9 exiting
improperly. There are three reasons this could occur:

1. this process did not call "init" before exiting, but others in the
job did. This can cause a job to hang indefinitely while it waits for
all processes to call "init". By rule, if one process calls "init",
then ALL processes must call "init" prior to termination.

2. this process called "init", but exited without calling "finalize".
By rule, all processes that call "init" MUST call "finalize" prior to
exiting or it will be considered an "abnormal termination"

3. this process called "MPI_Abort" or "prte_abort" and the mca
parameter prte_create_session_dirs is set to false. In this case, the
run-time cannot detect that the abort call was an abnormal
termination. Hence, the only error message you will receive is this
one.

This may have caused other processes in the application to be
terminated by signals sent by prterun (as reported here).

You can avoid this message by specifying -quiet on the prterun command
line.
--------------------------------------------------------------------------

I want to highlight that this error only happens with the ./cms function. Using ./getdata works just fine, both for HYCOM example and also for my own data. Also, the same nest files that receives an error message under MPI mode, could be used with no problem in a version compiled without MPI. For sure, this error is not being caused by the nest file itself.

Right now, my HPC environment settings are as follows: GCC-11.4.1 OpenMPI-5.0.2 HDF5-1.14.3 NetCDF-C-4.9.2 NetCDF-Fortran-4.6.0

I also had tried so many combinations of NetCDF-C, NetCDF-F and HDF5 versions, including the ones I was using in my personal computer and without MPI. The only thing I didn't tried yet is using an older OpenMPI version. I was willing to approach this task, but I ended up getting into a frustrating loop since older OpenMPI would also require an older GCC to build, and so on.

I would really appreciate if someone currently using CMS with MPI could share which package versions you are using, or maybe give some hints on the causes and solutions for this problem.

Thank you very much!

milancurcic commented 3 weeks ago

Hi, this error is not MPI related. I suspect there's a disconnect between your nest dimension attribute and your nest namelist file. Can you post the output of ncdump -v nest_1_20050101000000.nc as well as your nest_1.nml namelist file?

silvaglx commented 3 weeks ago

Thank you very much for your recommendation! You were right and the problem doesn't seems to be related to MPI at all. I checked my netcdf files and realize ./getdata is failing in storing Time as a valid variable in my nest files. I still couldn't figure it out why tho, even after several tests. I had also tried manually inserting the time dimension after running ./getdata but same error persists. Here are my nest_1.nml and ncdump outputs for checking. Obs: HYCOM example is actually running properly.

nest_1.txt

ncdump_hycom_example.txt

ncdump_own_data.txt

silvaglx commented 3 weeks ago

Hi,

I realized that the error was indeed related to a mismatch between my nest_1.nml file and my NetCDF files. Please disregard my previous comments 😅

When you run ./getdata, the dimensions and variable names in your input NetCDF files are modified. I was aware of this behavior, but when running ./cms without MPI it was not necessary for the variable names in nest_1.nml to match with the already processed files from ./getdata. However, when using MPI they must match. So, if you're running CMS with your local data, be sure to rename your variables in nest_1.nml after running ./getdata and before ./cms.

Thank you so much for your help, @milancurcic. Since my main issue has been resolved, I'll go ahead and close this issue.