NOAA-PMEL / Ferret

The Ferret program from NOAA/PMEL
https://ferret.pmel.noaa.gov/Ferret/
The Unlicense
55 stars 21 forks source link

SAVE writes wrong variable, after CANCEL VARIABLE/NOERROR of nonexistent variable #1957

Closed AndrewWittenberg closed 4 years ago

AndrewWittenberg commented 4 years ago

For several months I've occasionally seen strange Ferret behavior, that may be related to the fragile memory management issues reported in #1734.

The problems appear in complex scripts that define lots of variables. But the behavior is often hard to reproduce. I finally found a repeatable example on 8 October, but it was so complex (and I was so busy) that it took me until today (14 November) to whittle it down into a manageable bug report. The attached script is here:

index.csh.26487.jnl.gz

It started as many thousands of lines (with numerous nested "go" calls to my hierarchy of tools) which I have now simplified to a self-contained 261 lines. The gyrations in this script will make no sense out of context, with some lines doing nothing at all -- so just view it as a way to demo the bug. I can't simplify it further, since removing almost any line (including ones that shouldn't matter) will eliminate the misbehavior.

The sst.nc input file is too big to attach to this report, so I've placed a gzipped version on our FTP:

ftp://ftp.gfdl.noaa.gov/pub/Andrew.Wittenberg/EXPORT/ferret/sst.nc.gz

Download and gunzip the above file, and then run index.csh.26487.jnl in the same directory as sst.nc:

    NOAA/PMEL TMAP
    FERRET v7.503 (optimized)
    Linux 2.6.32-696.30.1.el6.x86_64 64-bit - 06/27/19
    14-Nov-19 21:29     

yes? go index.csh.26487.jnl 
   in default dataset sst
   APPEND_VAR = APP_V_IN
     APP_V_IN = 1
   DUMMY ??         <== UNKNOWN VARIABLE ******

So far so good: APPEND_VAR exists, and DUMMY does not. But when the script tries to save APPEND_VAR, the nonexistent DUMMY is written instead (!):

 LISTing to file out_file.nc
netcdf out_file {
variables:
    double DUMMY ;
        DUMMY:missing_value = -1.e+34 ;
        DUMMY:_FillValue = -1.e+34 ;
        DUMMY:long_name = "1" ;

// global attributes:
        :history = "FERRET V7.503 (optimized) 14-Nov-19" ;
        :Conventions = "CF-1.6" ;
}

Commenting out line 112, and uncommenting line 113, simply swaps the variables cancelled by CANCEL VARIABLE/NOERROR. This makes the bug disappear -- which is what makes me guess that it might have something to do with cancelling the nonexistent variable NONEXIST_VAR before the variable DUMMY:

    NOAA/PMEL TMAP
    FERRET v7.503 (optimized)
    Linux 2.6.32-696.30.1.el6.x86_64 64-bit - 06/27/19
    14-Nov-19 21:30     

yes? go index.csh.26487.jnl 
   in default dataset sst
   APPEND_VAR = APP_V_IN
     APP_V_IN = 1
   DUMMY ??         <== UNKNOWN VARIABLE ******
 LISTing to file out_file.nc
netcdf out_file {
variables:
    double APPEND_VAR ;
        APPEND_VAR:missing_value = -1.e+34 ;
        APPEND_VAR:_FillValue = -1.e+34 ;
        APPEND_VAR:long_name = "scale" ;

// global attributes:
        :history = "FERRET V7.503 (optimized) 14-Nov-19" ;
        :Conventions = "CF-1.6" ;
}

Could there be a bug in updating the stack of variables, or the variable IDs or pointers, when CANCEL VARIABLE/NOERROR is used on a nonexistent variable?

karlmsmith commented 4 years ago

I am seeing this same behavior with the latest Ferret code built in debug mode. So hopefully a session in the debugger will figure out what is happening here.

AnsleyManke commented 4 years ago

The behavior also occurs with a file created with "use coads_climatology; save/file=sst.nc sst"

And it occurs with all the older versions of Ferret I have sitting around back to the time when we introduced the commands such as SHOW VAR/TREE and CANCEL VAR/NOERR for getting and using variable-definition information info to use in scripts, (b6.9 through v6.96). This means it's not in the memory-management enhancements (v7.2), or dynamic coordinate handling (v7.3), which were my first thoughts for a bug that comes and goes with small changes to the scripts.

AnsleyManke commented 4 years ago

I think I have the answer here, or part of it. The script example makes a bunch of definitions and at one point has this, canceling a list of variables including one which is undefined followed by another which does exist.

...
let dummy = 1

! switching the order eliminates the bug!
can var/noerr nonexist_var dummy
...

If we're doing CANCEL VAR with a list of variables, and we're going through that list, if some variable is undefined, then that's an error and everything stops.

If it's CANCEL VAR/NOERROR with a list, then if some variable is undefined, we skip over the calls to handle the cancel-a-variable. When looking for the variable to be deleted, a setting is made to a flag related to what dataset the variable being deleted "belongs to". (It might be a LET/D= variable associated with a dataset, or just a general user-variable.) In this example, in the course of finding out that the first variable is is undefined that flag is set such a way that the next variable DUMMY is not removed from one of the internal structures of user-defined variables and so it's thought to exist later. So this is a matter of correctly re-setting conditions after a nonexistent variable so that the CANCEL VARIABLE works correctly for any remaining variables that are defined .

This is a simple code change and does make the example work correctly, but doesn't really explain why this is hard to reproduce. Just the relevant commands would be,

use coads_climatology
let dummy = 1
let other_var = 0

cancel var/noerror not_var,dummy
show variable/tree other_var, dummy
save/clobber/file=out_file.nc other_var

but - with any of the recent executables - the file is correct and contains other_var, not dummy.

I'm looking further into the details, but at the end of the week we can at least make a test executable with this particular change.

AnsleyManke commented 4 years ago

I've found what I believe is the cause of the variability of the appearance of this bug. The need to reinitialize the dataset-flag in xeq_cancel.F is indeed the fix.

And the second of the calls to PARSE_NAM_DSET in the loop that deletes the list of variables in that routine was sometimes resetting the dataset to the default dataset. This means that it depends what else has been going on in the session could influence whether the incorrect setting ever got made. When the call is CANCEL VAR (not CANCEL VAR/D) the second call is never needed, so don't call it in that instance.

I have checked in these changes and will go ahead and close this ticket, pending further testing at GFDL.