Closed serbinsh closed 7 years ago
And if I shouldn't be posting questions like this here please let me know....not sure if we had another place to put runtime questions.
Sean Swenson and I were just looking at one of these. He thinks that means there is a 'nan' somewhere. It's a big problem that PIO doesn't produce more sensible error messages in these types of cases (imho).
On 1 December 2016 at 14:07, Shawn P. Serbin notifications@github.com wrote:
And if I shouldn't be posting questions like this here please let me know....not sure if we had another place to put runtime questions.
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/NGEET/ed-clm/issues/154#issuecomment-264295250, or mute the thread https://github.com/notifications/unsubscribe-auth/AMWsQ3gTqNbaAhcjKYWfNs0GGUL3QEgqks5rDzcfgaJpZM4LB47f .
Dr Rosie A. Fisher
Staff Scientist Terrestrial Sciences Section Climate and Global Dynamics National Center for Atmospheric Research 1850 Table Mesa Drive Boulder, Colorado, 80305 USA. +1 303-497-1706
The actual error is:
NetCDF: Numeric conversion not representable
pio_support::pio_die:: myrank= -1 : ERROR: pionfwrite_mod::write_nfdarray_double: 249 : NetCDF: Numeric conversion not representable
MPI_Abort: error code = 1
It happens because you are trying to write an inf or nan to the file.
You can contact the pio developers with suggestions on how to improve the error handling.
The most efficient way I've found to debug this is to run in debug mode where floating point errors should be trapped and you'll get a core file, or you can at least run in the debugger.
I have been bothered by these errors for a while, but they only seem to pop up on my testing when I am generating high frequency output. For instance when I generate hourly output to evaluate diurnal patterns. Yes, I should have made an issue for this long ago, but I thought this was an edge use case.
hist_mfilt = 480 hist_nhtfrq = -1
If you are basing off of my scripts Shawn, then you may have this flag turned on as well:
hist_empty_htapes = .true.
In this case you are outputing only a small set of variables listed in "hist_fincl1". You could try removing these sequentially until you find the culprit.
@/all Thanks this is all very helpful!
@rgknox I have modified the ref case a bit for the PEcAn testing. Here are my options:
hist_empty_htapes = .true.
hist_fincl1='EFLX_LH_TOT','TSOI_10CM','QVEGT','NEP','GPP','AR','ED_bleaf','ED_biomass','NPP_column','NPP','MAINT_RESP','GROWTH_RESP'
hist_mfilt = 8760
hist_nhtfrq = -1
EOF
I could give removing each output until I find it for the failed runs if that would be helpful
I can also try the debugging angle as well to see if I can get any useful log output
Once you narrow it down to a specific variable, the next step is to check to see how the variable is initialized.
If it's initialized to spval or zero, then your parameter combinations are generating the floating point problem.
If it's initialized to nan, then there was an expectation that all the values would be valid by the time of the history write, so it is probably a logic bug....
My money is on NEP
On Thu, Dec 1, 2016 at 1:57 PM, Ben Andre notifications@github.com wrote:
Once you narrow it down to a specific variable, the next step is to check to see how the variable is initialized.
-
If it's initialized to spval or zero, then your parameter combinations are generating the floating point problem.
If it's initialized to nan, then there was an expectation that all the values would be valid by the time of the history write, so it is probably a logic bug....
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/NGEET/ed-clm/issues/154#issuecomment-264308108, or mute the thread https://github.com/notifications/unsubscribe-auth/AFnnjKFb0bL5uZzupkUUopW4-XELq0RPks5rD0KzgaJpZM4LB47f .
@rgknox I think you are right about NEP. When I re-ran a run that failed without NEP in the history the simulation ran past the troublesome year (1903, using CRUNCEP drivers). I am trying again using PIO logging set to 1 to see if I can find out what happens with NEP. One other thing I noticed with other runs is that NEP seems to provide data for the first or second year of the simulation but then stops producing good output?
@bandre-ucar What level of PIO logging do I need to try and sort out what the NEP error is? I set to 1 (from 0) and it did indeed provide a LOT of PIO logging but the error is still unclear.
0: def_var fh= 589824 name=fractions_lx_lfrin id= 92
/data/software/src/FATES/ed-clm/cime/externals/pio1/pio/pionfatt_mod.F90.in 158 _FillValue 1.0000000000000000E+030
/data/software/src/FATES/ed-clm/cime/externals/pio1/pio/pionfatt_mod.F90.in 158 unitsunknown
/data/software/src/FATES/ed-clm/cime/externals/pio1/pio/pionfatt_mod.F90.in 158 long_nameunknown
/data/software/src/FATES/ed-clm/cime/externals/pio1/pio/pionfatt_mod.F90.in 158 standard_nameunknown
/data/software/src/FATES/ed-clm/cime/externals/pio1/pio/pionfatt_mod.F90.in 158 internal_dnamefractions_lx
0 : invoking PIO_initdecomp_dof
piolib_mod.f90 1110 before calcstartandcount: 1 1 0 35 0
IAM: 0 after getiostartandcount: count is: 1 1 lenblocks = 1 ndisp= 1
IAM: 0 after getiostartandcount, num_aiotasks is: 1
PIO_initdecomp: calcdisplace 1 1 1 1 1 1 1
piolib_mod.f90 1180 iam: 0 initdecomp: userearranger: T 1
box_rearrange.F90.in 965 1 1 1 1
box_rearrange.F90.in 967 2 1 : 1 1 1 1
box_rearrange::box_rearrange_create:: comp_rank= 0 : io 1 start= 1 1 count= 1 1
piolib_mod.f90 1207 1 1 167 0 1
piodarray 502 NAME : IAM: 0 UseRearranger: T 1 0 1
piodarray::write_darray_nf_double: IAM: 0 Before call to allocate(IOBUF): 1 1
piodarray::write_darray_nf_double: {comp,io}_rank: 0 0 offset: 0 len: 1
piodarray 628 start: 1 1 count: 1 1 ndims: 2
/data/software/src/FATES/ed-clm/cime/externals/pio1/pio/pionfwrite_mod.F90.in 150 2 1 1 1 1
pionfwrite_mod::write_nfdarray_double: 0: done writing for self 2
NetCDF: Numeric conversion not representable
pio_support::pio_die:: myrank= -1 : ERROR: pionfwrite_mod::write_nfdarray_double: 249 : NetCDF: Numeric conversion not representable
MPI_Abort: error code = 1
ERROR IN MODEL RUN
Logfile is located at '/data/Model_Output/pecan.output/PEcAn_2000000418/out/2000034432/logfile.txt'
Crashing at this step in the land log
hist_htapes_wrapup : Writing current time sample to local history file ./case.clm2.h0.1903-01-01-00000.nc at nstep = 40562 for history time interval beginning at 845.00000000000000 and ending at 845.04166666666663
Here is the extensive log file: https://dl.dropboxusercontent.com/u/12774655/job.log (larger than 10mbs)
Here is the history file it was writing: https://dl.dropboxusercontent.com/u/12774655/case.clm2.h0.1903-01-01-00000.nc
@serbinsh if it is crashing due to nans in NEE, then it s most likely an issue happening in the heterotrophic respiration side of things. So it would be helpful to output the FATES -> BGC carbon fluxes, as well as the BGC pools (litter, soil carbon) and HR flux. this might help in tracking down where the nan is originating.
Your error is NetCDF: Numeric conversion not representable
and you determined it is NEP. This means your numerics are blowing up and generating a value that can't be represented by netcdf. Additional PIO logging and the various standard logs generally don't help with debugging at this point. The best advice I can give you is turn on debug mode to trap floating point errors. If that doesn't help, then you have to resort to the debugger and/or adding print statements to the logs to manually trace the evolution of NEP.
@bandre-ucar @ckoven Thanks for the feedback!
@bandre-ucar OK understood. Let me see of I can do some more debugging.
@ckoven Will do. If I don't already have those outputs set I will enable all of the BGC fluxes and pools
@serbinsh ok great. probably a good idea to turn off NEP output too since you already know it is getting triggered, but since it is only a diagnostic variable, the question is which of the prognostic variables that goes into it is bad.
@ckoven Ok so I re-ran an older run that failed with NEP turned on. I turned on the other outputs you suggested in December:
FATES_c_to_litr_lab_c FATES_c_to_litr_cel_c FATES_c_to_litr_lig_c TOTLITC TOTSOMC TOTLITC_1m TOTSOMC_1m HR
I get the same error....suggesting (I think) that one of these variables is the culprit. This of course was a run that failed before with NEP but with NEP off it runs to completion.
@rgknox
@rgknox I will try making a new brach locally and bringing in the changes in PR #174 and see if that fixes the NEP issues. So far, I was able to run successfully with HR on so I don't think it was that var.
Summary of Issue:
I am testing FATES in PEcAn on modex (thanks @rgknox !!) and most runs finish fine. However for some (which i suspect is because it doesn't like some combo of the params) I get this error during execution of the code (not setup):
As for the NetCDF "Variable not found" I seem to be getting that everytime but it doesn't seem to cause any errors.
It may be that I need to look at some logs with @rgknox in case it is a machine config issue but I have had successful runs, but general over shorter run times. Here I am trying a 1901 to 2004 run
Would anyone happen to know what this error might be related to? Again it could be that there are some checks that happen in FATES and that the param inputs I am provided within the sens analysis are outside some bounds or cause some error
Expected behavior and actual behavior:
NA
Steps to reproduce the problem (should include create_newcase or create_test command along with any user_nl or xml changes):
NA
What is the changeset ID of the code, and the machine you are using:
Machine: modex.bnl.gov
have you modified the code? If so, it must be committed and available for testing:
No
Screen output or output files showing the error message and context:
Here is the log file logfile.txt