Error message on failed runs

serbinsh commented 7 years ago

Summary of Issue:

I am testing FATES in PEcAn on modex (thanks @rgknox !!) and most runs finish fine. However for some (which i suspect is because it doesn't like some combo of the params) I get this error during execution of the code (not setup):

-----------------------------------

NODE#  NAME
(    0)  node01
 NetCDF: Invalid dimension ID or name
 NetCDF: Invalid dimension ID or name
 NetCDF: Invalid dimension ID or name
 NetCDF: Invalid dimension ID or name
 NetCDF: Invalid dimension ID or name
 NetCDF: Variable not found
 NetCDF: Invalid dimension ID or name
 NetCDF: Invalid dimension ID or name
 NetCDF: Invalid dimension ID or name
 NetCDF: Invalid dimension ID or name
 NetCDF: Invalid dimension ID or name
 NetCDF: Variable not found
 NetCDF: Variable not found
 NetCDF: Variable not found
 NetCDF: Invalid dimension ID or name
 NetCDF: Invalid dimension ID or name
 NetCDF: Invalid dimension ID or name
 NetCDF: Invalid dimension ID or name
 NetCDF: Invalid dimension ID or name
 NetCDF: Variable not found
 NetCDF: Invalid dimension ID or name
 NetCDF: Invalid dimension ID or name
 NetCDF: Invalid dimension ID or name
 NetCDF: Invalid dimension ID or name
 NetCDF: Invalid dimension ID or name
 NetCDF: Invalid dimension ID or name
 NetCDF: Invalid dimension ID or name
 NetCDF: Invalid dimension ID or name
 NetCDF: Invalid dimension ID or name
 NetCDF: Invalid dimension ID or name
 NetCDF: Invalid dimension ID or name
 NetCDF: Variable not found
 NetCDF: Variable not found
 NetCDF: Numeric conversion not representable
 pio_support::pio_die:: myrank=          -1 : ERROR: pionfwrite_mod::write_nfdarray_double:         249 : NetCDF: Numeric conversion not representable
MPI_Abort: error code = 1

As for the NetCDF "Variable not found" I seem to be getting that everytime but it doesn't seem to cause any errors.

It may be that I need to look at some logs with @rgknox in case it is a machine config issue but I have had successful runs, but general over shorter run times. Here I am trying a 1901 to 2004 run

Would anyone happen to know what this error might be related to? Again it could be that there are some checks that happen in FATES and that the param inputs I am provided within the sens analysis are outside some bounds or cause some error

Expected behavior and actual behavior:

NA

Steps to reproduce the problem (should include create_newcase or create_test command along with any user_nl or xml changes):

NA

What is the changeset ID of the code, and the machine you are using:

Machine: modex.bnl.gov

have you modified the code? If so, it must be committed and available for testing:

No

Screen output or output files showing the error message and context:

Here is the log file logfile.txt

serbinsh commented 7 years ago

And if I shouldn't be posting questions like this here please let me know....not sure if we had another place to put runtime questions.

rosiealice commented 7 years ago

Sean Swenson and I were just looking at one of these. He thinks that means there is a 'nan' somewhere. It's a big problem that PIO doesn't produce more sensible error messages in these types of cases (imho).

On 1 December 2016 at 14:07, Shawn P. Serbin notifications@github.com wrote:

And if I shouldn't be posting questions like this here please let me know....not sure if we had another place to put runtime questions.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/NGEET/ed-clm/issues/154#issuecomment-264295250, or mute the thread https://github.com/notifications/unsubscribe-auth/AMWsQ3gTqNbaAhcjKYWfNs0GGUL3QEgqks5rDzcfgaJpZM4LB47f .

--

Dr Rosie A. Fisher

Staff Scientist Terrestrial Sciences Section Climate and Global Dynamics National Center for Atmospheric Research 1850 Table Mesa Drive Boulder, Colorado, 80305 USA. +1 303-497-1706

http://www.cgd.ucar.edu/staff/rfisher/

bandre-ucar commented 7 years ago

The actual error is:

 NetCDF: Numeric conversion not representable
 pio_support::pio_die:: myrank=          -1 : ERROR: pionfwrite_mod::write_nfdarray_double:         249 : NetCDF: Numeric conversion not representable
MPI_Abort: error code = 1

It happens because you are trying to write an inf or nan to the file.

You can contact the pio developers with suggestions on how to improve the error handling.

The most efficient way I've found to debug this is to run in debug mode where floating point errors should be trapped and you'll get a core file, or you can at least run in the debugger.

rgknox commented 7 years ago

I have been bothered by these errors for a while, but they only seem to pop up on my testing when I am generating high frequency output. For instance when I generate hourly output to evaluate diurnal patterns. Yes, I should have made an issue for this long ago, but I thought this was an edge use case.

hist_mfilt = 480 hist_nhtfrq = -1

If you are basing off of my scripts Shawn, then you may have this flag turned on as well:

hist_empty_htapes = .true.

In this case you are outputing only a small set of variables listed in "hist_fincl1". You could try removing these sequentially until you find the culprit.

serbinsh commented 7 years ago

@/all Thanks this is all very helpful!

@rgknox I have modified the ref case a bit for the PEcAn testing. Here are my options:

hist_empty_htapes = .true.
hist_fincl1='EFLX_LH_TOT','TSOI_10CM','QVEGT','NEP','GPP','AR','ED_bleaf','ED_biomass','NPP_column','NPP','MAINT_RESP','GROWTH_RESP'
hist_mfilt             = 8760
hist_nhtfrq            = -1
EOF

I could give removing each output until I find it for the failed runs if that would be helpful

I can also try the debugging angle as well to see if I can get any useful log output

bandre-ucar commented 7 years ago

Once you narrow it down to a specific variable, the next step is to check to see how the variable is initialized.

If it's initialized to spval or zero, then your parameter combinations are generating the floating point problem.
If it's initialized to nan, then there was an expectation that all the values would be valid by the time of the history write, so it is probably a logic bug....

rgknox commented 7 years ago

My money is on NEP

On Thu, Dec 1, 2016 at 1:57 PM, Ben Andre notifications@github.com wrote:

Once you narrow it down to a specific variable, the next step is to check to see how the variable is initialized.

-

If it's initialized to spval or zero, then your parameter combinations are generating the floating point problem.

If it's initialized to nan, then there was an expectation that all the values would be valid by the time of the history write, so it is probably a logic bug....

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/NGEET/ed-clm/issues/154#issuecomment-264308108, or mute the thread https://github.com/notifications/unsubscribe-auth/AFnnjKFb0bL5uZzupkUUopW4-XELq0RPks5rD0KzgaJpZM4LB47f .

serbinsh commented 7 years ago

@rgknox I think you are right about NEP. When I re-ran a run that failed without NEP in the history the simulation ran past the troublesome year (1903, using CRUNCEP drivers). I am trying again using PIO logging set to 1 to see if I can find out what happens with NEP. One other thing I noticed with other runs is that NEP seems to provide data for the first or second year of the simulation but then stops producing good output?

serbinsh commented 7 years ago

@bandre-ucar What level of PIO logging do I need to try and sort out what the NEP error is? I set to 1 (from 0) and it did indeed provide a LOT of PIO logging but the error is still unclear.

 0: def_var fh=      589824 name=fractions_lx_lfrin id=          92
 /data/software/src/FATES/ed-clm/cime/externals/pio1/pio/pionfatt_mod.F90.in         158 _FillValue   1.0000000000000000E+030
 /data/software/src/FATES/ed-clm/cime/externals/pio1/pio/pionfatt_mod.F90.in         158 unitsunknown
 /data/software/src/FATES/ed-clm/cime/externals/pio1/pio/pionfatt_mod.F90.in         158 long_nameunknown
 /data/software/src/FATES/ed-clm/cime/externals/pio1/pio/pionfatt_mod.F90.in         158 standard_nameunknown
 /data/software/src/FATES/ed-clm/cime/externals/pio1/pio/pionfatt_mod.F90.in         158 internal_dnamefractions_lx
           0 : invoking PIO_initdecomp_dof
 piolib_mod.f90        1110 before calcstartandcount:            1           1           0          35           0
 IAM:            0  after getiostartandcount: count is:                     1                    1  lenblocks =           1  ndisp=                    1
 IAM:            0  after getiostartandcount, num_aiotasks is:            1
 PIO_initdecomp: calcdisplace                    1           1           1                    1                    1                    1                    1
 piolib_mod.f90        1180 iam:            0 initdecomp: userearranger:  T                    1
 box_rearrange.F90.in         965                    1                    1                    1                    1
 box_rearrange.F90.in         967           2           1  :                    1                    1                    1                    1
 box_rearrange::box_rearrange_create:: comp_rank=           0 : io            1  start=                    1                    1  count=                    1                    1
 piolib_mod.f90        1207                    1           1         167           0           1
 piodarray         502  NAME : IAM:            0  UseRearranger:  T                    1           0           1
 piodarray::write_darray_nf_double: IAM:            0 Before call to allocate(IOBUF):            1                    1
 piodarray::write_darray_nf_double: {comp,io}_rank:            0           0 offset:            0 len:            1
 piodarray         628 start:                    1                    1  count:                    1                    1  ndims:           2
 /data/software/src/FATES/ed-clm/cime/externals/pio1/pio/pionfwrite_mod.F90.in         150           2           1           1           1           1
 pionfwrite_mod::write_nfdarray_double: 0: done writing for self           2
 NetCDF: Numeric conversion not representable
 pio_support::pio_die:: myrank=          -1 : ERROR: pionfwrite_mod::write_nfdarray_double:         249 : NetCDF: Numeric conversion not representable
MPI_Abort: error code = 1
ERROR IN MODEL RUN
Logfile is located at '/data/Model_Output/pecan.output/PEcAn_2000000418/out/2000034432/logfile.txt'

serbinsh commented 7 years ago

Crashing at this step in the land log

 hist_htapes_wrapup : Writing current time sample to local history file ./case.clm2.h0.1903-01-01-00000.nc at nstep =        40562  for history time interval beginning at    845.00000000000000       and ending at    845.04166666666663

Here is the extensive log file: https://dl.dropboxusercontent.com/u/12774655/job.log (larger than 10mbs)

Here is the history file it was writing: https://dl.dropboxusercontent.com/u/12774655/case.clm2.h0.1903-01-01-00000.nc

ckoven commented 7 years ago

@serbinsh if it is crashing due to nans in NEE, then it s most likely an issue happening in the heterotrophic respiration side of things. So it would be helpful to output the FATES -> BGC carbon fluxes, as well as the BGC pools (litter, soil carbon) and HR flux. this might help in tracking down where the nan is originating.

bandre-ucar commented 7 years ago

Your error is NetCDF: Numeric conversion not representable and you determined it is NEP. This means your numerics are blowing up and generating a value that can't be represented by netcdf. Additional PIO logging and the various standard logs generally don't help with debugging at this point. The best advice I can give you is turn on debug mode to trap floating point errors. If that doesn't help, then you have to resort to the debugger and/or adding print statements to the logs to manually trace the evolution of NEP.

serbinsh commented 7 years ago

@bandre-ucar @ckoven Thanks for the feedback!

@bandre-ucar OK understood. Let me see of I can do some more debugging.

@ckoven Will do. If I don't already have those outputs set I will enable all of the BGC fluxes and pools

ckoven commented 7 years ago

@serbinsh ok great. probably a good idea to turn off NEP output too since you already know it is getting triggered, but since it is only a diagnostic variable, the question is which of the prognostic variables that goes into it is bad.

serbinsh commented 7 years ago

@ckoven Ok so I re-ran an older run that failed with NEP turned on. I turned on the other outputs you suggested in December:

FATES_c_to_litr_lab_c FATES_c_to_litr_cel_c FATES_c_to_litr_lig_c TOTLITC TOTSOMC TOTLITC_1m TOTSOMC_1m HR

I get the same error....suggesting (I think) that one of these variables is the culprit. This of course was a run that failed before with NEP but with NEP off it runs to completion.

@rgknox

serbinsh commented 7 years ago

@rgknox I will try making a new brach locally and bringing in the changes in PR #174 and see if that fixes the NEP issues. So far, I was able to run successfully with HR on so I don't think it was that var.

NGEET / fates