E3SM-Project / E3SM

Energy Exascale Earth System Model source code. NOTE: use "maint" branches for your work. Head of master is not validated.
https://docs.e3sm.org/E3SM
Other
350 stars 360 forks source link

PIO error on cetus for current master #595

Closed wlin7 closed 8 years ago

wlin7 commented 8 years ago

Hi @jayeshkrishna ,

I am doing a test on cetus using the master merged up to Dec. 28. It is ne30+L72 with compset FC5CLBMG2MAM4RESUS (including most of the modules for v1). The following error appeared at the time of writing cam.h0.

17: pionfwrite_mod::write_nfdarray_double 107 IAM: 2 start: 1 5 1 count: 48602 2 1 size : 1 error: -60 -872382432 32 1: pionfwrite_mod::write_nfdarray_double 107 IAM: 0 start: 1 1 1 count: 48602 2 1 size : 1 error: -60 -872349660 129 9: pionfwrite_mod::write_nfdarray_double 107 IAM: 1 start: 1 3 1 count: 48602 2 1 size : 1 error: -60 -872382432 32 17: pio_support::pio_die:: myrank= -1 : ERROR: pionfwrite_mod::write_nfdarray_double: 250 : Numeric conversion not representable

Any clue what could be the cause and what can be tested?

It was having 2048 tasks 16 threads on 512 nodes. PIO_NUMTASKS=128, PIO_STRIDE=8. Thanks.

jayeshkrishna commented 8 years ago

The error message seems to be from pnetcdf when trying to write data. Are there any outlier values in the data (inf) that could cause this?

wlin7 commented 8 years ago

@jayeshkrishna , indeed it appears the root problem is outside of PIO. When only turning CLUBB mode on, it would run ok. Several other major atmos mods are enabled in the failed run, including MAM4, PolarMods, Some of these mods are not working properly on cetus. @kaizhangpnl , can you also test the v1alpha integration on cetus.

There is one other cetus/mira problem: standard script in current master would fail to link to create executable. The error is about undefined reference to xlsmp library, such as cam_history.F90:4433: undefined reference to `_xlsmpParSelf'.

Don't know if the problem has been reported and fixed in a branch. For now, I just applied a workaround in Macros by adding the following.

 ifeq ($(SUPPORTS_CXX), TRUE)
   LDFLAGS += $(CXX_LIBS)
 endif

Is it supposed to work like this? Can you please take a look and take care if it? BTW, setting SUPPORTS_CXX = FALSE does not disable the use of CXX_LIBS, which includes libxlsmp.

jayeshkrishna commented 8 years ago

Looks like an issue with mixing threaded (OMP) flags with a non-threaded build

jayeshkrishna commented 8 years ago

When specifying the flags in the makefile, the threaded (openmp) flags should not be specified for single threaded builds. Is this issue on master (What is the case that I can use to reproduce the problem)?

wlin7 commented 8 years ago

I feel like it is not limited to non-threaded build. It happens to builds with single or multiple threads. For a test, you may try any typical configuration, for example with "-compset FC5 -res ne30_ne30", and check the Macros.

Regarding threading flag, -qsmp=omp is always used for compilation for multi-threaded configuration. But -qsmp=omp is no more included in LDFLAGS in the current master for cetus/mira, while it is used in slightly earlier master (merged up to late November). Don't know if -qsmp=omp really needed for LDFLAGS.

With CXX_LIBS, which include -lxlsmp, the build does succeed. Not sure if it is the proper way, though. The earlier version of master does not involve CXX_LIBS (and not seem to need libxlsmp, at least not explicitly thru LDFLAGS)

jayeshkrishna commented 8 years ago

Thanks @wlin7 , I will test the case above on master and get back to you with my findings.

jayeshkrishna commented 8 years ago

Yes, this (using CXX_LIBS instead of -qsmp=omp) was a change made in c2165ece13b469e4fc1f91eb694a0719099e67f2 and this case was definitely not covered (tested) in the commit. I am working on a possible fix and let you know how it goes.

wlin7 commented 8 years ago

Thanks. A formal fix should be nice. For now, I am ok with my approach by adding the few lines in Macros. I have also posted this issue on https://acme-climate.atlassian.net/wiki/display/ATM/Problems+running+post-cime+v1-alpha+model in case some users happened to be using the current version on mira/cetus.

jayeshkrishna commented 8 years ago

The discussion related to the build failures on Cetus/Mira is continue in #599

jayeshkrishna commented 8 years ago

@wlin7 : Can we close this issue ("Numeric conversion not representable" error) and continue the build error discussions in #599 ? If you want to keep this issue (error : "Numeric conversion not representable") open please continue the build error discussions in #599 .

wlin7 commented 8 years ago

Hi @jayeshkrishna , yes, please close this one. #599 definitely is the better place for further discussion.

jayeshkrishna commented 8 years ago

Closing this issue ("Numeric conversion not representable" error is most likely due to data passed to PIO that cannot be written out using PnetCDF - e.g. data values like inf that needs to be replaced by missing values) and continuing discussion on build error on Cetus/Mira in #599