Closed wlin7 closed 8 years ago
The error message seems to be from pnetcdf when trying to write data. Are there any outlier values in the data (inf) that could cause this?
@jayeshkrishna , indeed it appears the root problem is outside of PIO. When only turning CLUBB mode on, it would run ok. Several other major atmos mods are enabled in the failed run, including MAM4, PolarMods, Some of these mods are not working properly on cetus. @kaizhangpnl , can you also test the v1alpha integration on cetus.
There is one other cetus/mira problem: standard script in current master would fail to link to create executable. The error is about undefined reference to xlsmp library, such as cam_history.F90:4433: undefined reference to `_xlsmpParSelf'.
Don't know if the problem has been reported and fixed in a branch. For now, I just applied a workaround in Macros by adding the following.
ifeq ($(SUPPORTS_CXX), TRUE)
LDFLAGS += $(CXX_LIBS)
endif
Is it supposed to work like this? Can you please take a look and take care if it? BTW, setting SUPPORTS_CXX = FALSE does not disable the use of CXX_LIBS, which includes libxlsmp.
Looks like an issue with mixing threaded (OMP) flags with a non-threaded build
When specifying the flags in the makefile, the threaded (openmp) flags should not be specified for single threaded builds. Is this issue on master (What is the case that I can use to reproduce the problem)?
I feel like it is not limited to non-threaded build. It happens to builds with single or multiple threads. For a test, you may try any typical configuration, for example with "-compset FC5 -res ne30_ne30", and check the Macros.
Regarding threading flag, -qsmp=omp is always used for compilation for multi-threaded configuration. But -qsmp=omp is no more included in LDFLAGS in the current master for cetus/mira, while it is used in slightly earlier master (merged up to late November). Don't know if -qsmp=omp really needed for LDFLAGS.
With CXX_LIBS, which include -lxlsmp, the build does succeed. Not sure if it is the proper way, though. The earlier version of master does not involve CXX_LIBS (and not seem to need libxlsmp, at least not explicitly thru LDFLAGS)
Thanks @wlin7 , I will test the case above on master and get back to you with my findings.
Yes, this (using CXX_LIBS instead of -qsmp=omp) was a change made in c2165ece13b469e4fc1f91eb694a0719099e67f2 and this case was definitely not covered (tested) in the commit. I am working on a possible fix and let you know how it goes.
Thanks. A formal fix should be nice. For now, I am ok with my approach by adding the few lines in Macros. I have also posted this issue on https://acme-climate.atlassian.net/wiki/display/ATM/Problems+running+post-cime+v1-alpha+model in case some users happened to be using the current version on mira/cetus.
The discussion related to the build failures on Cetus/Mira is continue in #599
@wlin7 : Can we close this issue ("Numeric conversion not representable" error) and continue the build error discussions in #599 ? If you want to keep this issue (error : "Numeric conversion not representable") open please continue the build error discussions in #599 .
Hi @jayeshkrishna , yes, please close this one. #599 definitely is the better place for further discussion.
Closing this issue ("Numeric conversion not representable" error is most likely due to data passed to PIO that cannot be written out using PnetCDF - e.g. data values like inf that needs to be replaced by missing values) and continuing discussion on build error on Cetus/Mira in #599
Hi @jayeshkrishna ,
I am doing a test on cetus using the master merged up to Dec. 28. It is ne30+L72 with compset FC5CLBMG2MAM4RESUS (including most of the modules for v1). The following error appeared at the time of writing cam.h0.
17: pionfwrite_mod::write_nfdarray_double 107 IAM: 2 start: 1 5 1 count: 48602 2 1 size : 1 error: -60 -872382432 32 1: pionfwrite_mod::write_nfdarray_double 107 IAM: 0 start: 1 1 1 count: 48602 2 1 size : 1 error: -60 -872349660 129 9: pionfwrite_mod::write_nfdarray_double 107 IAM: 1 start: 1 3 1 count: 48602 2 1 size : 1 error: -60 -872382432 32 17: pio_support::pio_die:: myrank= -1 : ERROR: pionfwrite_mod::write_nfdarray_double: 250 : Numeric conversion not representable
Any clue what could be the cause and what can be tested?
It was having 2048 tasks 16 threads on 512 nodes. PIO_NUMTASKS=128, PIO_STRIDE=8. Thanks.