NOAA-PMEL / Ferret

The Ferret program from NOAA/PMEL
https://ferret.pmel.noaa.gov/Ferret/
The Unlicense
55 stars 21 forks source link

corrupt data when defining very large number of variables #1757

Open karlmsmith opened 6 years ago

karlmsmith commented 6 years ago

Reported by @AnsleyManke on 11 Nov 2016 01:31 UTC Ned Cokelet ran into this. He's reading a file, reshaping the data to a new axis, defining several masks and a regridding operation, all for a dataset with 79 variables.

His script winds up defining over 1400 variables, and somewhere along the line, the internal context or memory resident information is being corrupted; the wrong data is loaded into memory.

Ned and I have reworked the script so it doesn't have all the variables defined and loaded at once. However if a limit is being reached, Ferret should detect that and report an error.

I will attach the set of scripts and the dataset that they read. Run the script "run.jnl" to run the example.

Migrated-From: http://dunkel.pmel.noaa.gov/trac/ferret/ticket/2485

karlmsmith commented 6 years ago

Comment by steven.c.hankin on 17 Nov 2016 19:33 UTC Unable to see the error in detail, but speculating ... Ferret memory consists of a large contiguous array ("memory") managed as a doubly linked list of blocks. The likely cause of the bug reported here is that somehow the management of blocks in the large memory array gets corrupted, resulting in a block that was "owned" by one variable getting overwritten by another variable. Likely the linked lists got corrupted. Debugging will not be fun, as the error shows up only after the lists have become very tangled and complex.

If trac ticket #2369 gets implemented the entire machinery of memory block management will disappear. So if we are going to move forward on that ticket, it may make sense to postpone the investigation of this misbehaving script until afterwards.

karlmsmith commented 6 years ago

Comment by @AnsleyManke on 17 Nov 2016 20:43 UTC If I turn on "MODE DIAGNOSTIC", it seems more that the mr list is corrupted. At some point, on loading one computed variable, an unrelated variable is being read,

yes? set mode diag
yes? go run
... 
...

save/file=($2)/outtype=float/append ir_thermo_temperature_filtered
 !-> LIST/FORMAT=CDF/file=Saildrone_sd-128_data.nc/outtype=float/append ir_thermo_temperature_filtered

...  many lines of diagnostic output ...

 reading WIN-S_IN M:307 dset:   1 I:    1 1309  J: -999 -999  K: -999 -999  L: -999 -999  M:    1    1  N: -999 -999

which is one of several WIND_SPEED variables and has nothing to do with computing ir_thermo_temperature_filtered. So it seems to me that the mr pointer info has gotten corrupted, and not the contents of memory that's been read in.

However I agree that it's well worth pursuing the new memory management first before delving into this any more.

karlmsmith commented 6 years ago

Comment by @karlmsmith on 17 Nov 2016 21:46 UTC This is starting to remind me of #2462. The diagnostic output now looks like (from PyFerret):

yes? set mode diag
yes? use coads_climatology
yes? let a = 0; let b = 0; let c = 0
yes? can var c
yes? can var /noerr c b
yes? let b = `a`
 getgrid EX#1     C:  5 dset:   1 I:    1    1  J:    1    1  K:    1    1  L:    1    1  M:    1    1  N:    1    1
 getgrid A        C:  7 dset:   1 I:    1    1  J:    1    1  K:    1    1  L:    1    1  M:    1    1  N:    1    1
 eval    EX#1     C:  4 dset:   1 I: -999 -999  J: -999 -999  K: -999 -999  L: -999 -999  M: -999 -999  N: -999 -999
 eval    A        C:  6 dset:   0 I: -999 -999  J: -999 -999  K: -999 -999  L: -999 -999  M: -999 -999  N: -999 -999
 constan cnst     M:  2 dset:   0 I: -999 -999  J: -999 -999  K: -999 -999  L: -999 -999  M: -999 -999  N: -999 -999
 -DELETE cnst     M:  2 dset:   0 I: -999 -999  J: -999 -999  K: -999 -999  L: -999 -999  M: -999 -999  N: -999 -999
 -DELETE A        M:  2 dset:   0 I: -999 -999  J: -999 -999  K: -999 -999  L: -999 -999  M: -999 -999  N: -999 -999
 !-> DEFINE VARIABLE b = 0
yes? let a = 0
 dealloc  dynamic grid ABSTRACT        ABSTRACT  ABSTRACT  ABSTRACT  ABSTRACT  ABSTRACT  ABSTRACT
 -DELETE A        M:  1 dset:   0 I: -999 -999  J: -999 -999  K: -999 -999  L: -999 -999  M: -999 -999  N: -999 -999
 dealloc  dynamic grid ABSTRACT        ABSTRACT  ABSTRACT  ABSTRACT  ABSTRACT  ABSTRACT  ABSTRACT

(The "let a = 0" used to cause corruption errors. But the M: 2 lines use to be M: 1, and the M: 1 line used to be M: 2.) I am not sure how to interpret all this diagnostic info. I just wanted to be sure this is now looking correct. Or possibly this is a simple example to follow through with the debugger to make sure everything is going down the correct paths.

When I looked at this example when it was giving the error, I got the impression that variable locations were getting messed up. The opening of coads_climatology was an essential part of causing the error because of different memory managing (compacting?) behaviour for dataset variables and user-defined variables - and it appeared a user-defined variable was getting treated as a dataset variable.

karlmsmith commented 6 years ago

Comment by steven.c.hankin on 7 Dec 2016 21:44 UTC This diagnostic output doesn't necessarily indicate a problem. The M value is the memory slot occupied by a particular instantiated subset (region) of a variable (a so-called "memory variable" or "mvar" or "mr"). When the script deletes the definition of variable A, it frees all mvars connected to A. When A gets defined anew, it starts from a blank slate, getting mvars assigned to it, when it needs them.

Mvars are managed by the doubly linked list that is the variables mv_flink and mv_blink ("memory variable forward/back link"). Similarly mblk_flink and mblk_blink manage the "memory blocks". These are likely candidates as the location where corruption is occurring.

I propose to take the following approach to debugging this -- inviting your comments before I get started: o The challenge is that the error we are tracking down materializes only after a complex script has been run. By this point the linked lists (if they are indeed the cause) have become awfully tangled for a human sleuth to follow. I propose to write a new routine SANITY_CHECK_LINKS that will check the links and output a message if a problem is found. When running in MODE DIAGNOSTIC, this routine will get called before exiting any of the significant memory management routines -- get_memory, free_memory, etc. The goal is to see a report of the corruption at the point that it first occurs. Then using totalview my hope is to be able to more easily track down the cause of the corruption.

karlmsmith commented 6 years ago

Comment by steven.c.hankin on 9 Dec 2016 19:30 UTC Progress report: I am now using new routine SANITY_CHECK_MEMORY as a tool (in diagnostic mode, only) to detect when memory corruption occurs.

In the course of early testing I discovered that MR_NOT_IN_USE is being called duplicatively in some cases -- e.g. in IS_RECONCILE and then again in UNPROTECT_CMND_DATA. In a complex calculation, where the same component data is being reserved for use in two separate upcoming parts of a calculation and memory capacity is hard-pressed, duplicated calls like this could have the effect of placing a component onto the deletion chain (where it becomes vulnerable to automated deletion if more memory is needed) when it is still needed for a future part of the calculation. If this occurred the data that "mr" pointed to could be over-written by other data. This matches the symptoms uncovered in Ned's script.

karlmsmith commented 6 years ago

Comment by steven.c.hankin on 3 Jan 2017 17:54 UTC I am embarrassed to admit that I spent over 6 hours working to track down this bug and still failed. So it goes ...

I was looking for a corruption of memory -- either the mr table (memory resident variables, a.k.a. "mr's") or the mblk array (the chunk of memory allocated to each memory resident variable). I wrote a new routine SANITY_CHECK_MEMORY to perform internal checking. It is now an in-built part of Ferret (invoked whenever CREATE_MEM_VAR is called if the code is compiled debug, and mode diagnostic is ON). But, bottom line: no memory corruption of this type seems to be occurring. I was looking under the wrong rock. So where else to look?

What I observed is that the error shows up only when the mr table is completely full. At this point variables get deleted from cache (chosen from the delete priority chain, mr_del_flink) whenever a slot is needed for a new variable to be created. In Ned's script this situation is not reached until after the first (uncorrupted) plot has been generated. To see this in totalview put a breakpoint in FIND_MR_SLOT right after "100 IF ( mv_flink( 0 ) . EQ. 0 ) THEN". This IF tests whether a free mr slot is available.

My belief is that the problem is occurring because the automatic deletion is (incorrectly) deleting an mr that has yet to be used. For example, given LET v=a+b, Ferret will compute and cache mr's a and b, in order to use them later to compute v. If for some reason a were (incorrectly) deleted before v had used it, v would be computed using whatever (incorrect) numerical values were found in the new variable occupying a's mr slot.

To successfully debug this problem I believe we need to reproduce it in a simpler example. The corruption is showing up in the variable PAR_AIR_FILTERED. The dependency tree of this variable is frightening!

yes? vtree par_air_filtered
   in default dataset Saildrone_sd-128_payloads_20160523_0000UTC_to_20160524_0000UTC_60_s_intervals
   PAR_AIR_FILTERED = PAR_AIR_FILTERED_RESHAPE_FILL[GT=T_MIN_EQ_AX]
     PAR_AIR_FILTERED_RESHAPE_FILL = IFV ( PAR_AIR_FILTERED_RESHAPE_CD_NR_MASK AND PAR_AIR_FILTERED_RESHAPE_CDS2 LE GAP_
FILL ) THEN PAR_AIR_FILTERED_RESHAPE_NEAR ELSE PAR_AIR_FILTERED_RESHAPE_INTRP
       PAR_AIR_FILTERED_RESHAPE_CD_NR_MASK = IF ( PAR_AIR_FILTERED_RESHAPE_CDS_NRTST EQ -999 ) THEN 1
         PAR_AIR_FILTERED_RESHAPE_CDS_NRTST = MISSING( PAR_AIR_FILTERED_RESHAPE_CDS, -999)
           PAR_AIR_FILTERED_RESHAPE_CDS = PAR_AIR_FILTERED_RESHAPE_CDA + PAR_AIR_FILTERED_RESHAPE_CDB
             PAR_AIR_FILTERED_RESHAPE_CDA = PAR_AIR_FILTERED_RESHAPE[T=@CDA]
               PAR_AIR_FILTERED_RESHAPE = RESHAPE( PAR_AIR_FILTERED_IN, T[GT=INST_T_AX] )
                 (C001,V***) = RESHAPE( WIND_SPEED_RMS_IN, T[GT=INST_T_AX] )
                   (C029,V***) = T[GT=INST_T_AX]
             PAR_AIR_FILTERED_RESHAPE_CDB = PAR_AIR_FILTERED_RESHAPE[T=@CDB]
               PAR_AIR_FILTERED_RESHAPE =  (defined above)
       PAR_AIR_FILTERED_RESHAPE_CDS2 = ( PAR_AIR_FILTERED_RESHAPE_CDA_NRTST + PAR_AIR_FILTERED_RESHAPE_CDB_NRTST ) * PAR
_AIR_FILTERED_RESHAPE_CD_NR_MASK
         PAR_AIR_FILTERED_RESHAPE_CDA_NRTST = MISSING( PAR_AIR_FILTERED_RESHAPE_CDA, 0.0)
           PAR_AIR_FILTERED_RESHAPE_CDA =  (defined above)
         PAR_AIR_FILTERED_RESHAPE_CDB_NRTST = MISSING( PAR_AIR_FILTERED_RESHAPE_CDB, 0.0)
           PAR_AIR_FILTERED_RESHAPE_CDB =  (defined above)
         PAR_AIR_FILTERED_RESHAPE_CD_NR_MASK =  (defined above)
       GAP_FILL = 1200
       PAR_AIR_FILTERED_RESHAPE_NEAR = PAR_AIR_FILTERED_RESHAPE_CD_NR_MASK*PAR_AIR_FILTERED_RESHAPE_NR
         PAR_AIR_FILTERED_RESHAPE_CD_NR_MASK =  (defined above)
         PAR_AIR_FILTERED_RESHAPE_NR = PAR_AIR_FILTERED_RESHAPE[T=@FNR]
           PAR_AIR_FILTERED_RESHAPE =  (defined above)
       PAR_AIR_FILTERED_RESHAPE_INTRP = PAR_AIR_FILTERED_RESHAPE_CDS_MASK*PAR_AIR_FILTERED_RESHAPE_LIN
         PAR_AIR_FILTERED_RESHAPE_CDS_MASK = IF ( PAR_AIR_FILTERED_RESHAPE_CDS LE GAP_FILL ) THEN 1
           PAR_AIR_FILTERED_RESHAPE_CDS =  (defined above)
           GAP_FILL =  (defined above)
         PAR_AIR_FILTERED_RESHAPE_LIN = PAR_AIR_FILTERED_RESHAPE[T=@FLN]
           PAR_AIR_FILTERED_RESHAPE =  (defined above)

Correctly managing the cache dependencies is the responsibility of the Ferret variable mr_protected, which counts the number of descendant variables that depend upon it (or may be a special flag value such as mr_permanent, that indicates other caching behavior).

The problem in the cache logic may be

1. a fundamental flaw that has never been observed, because the automatic cache deletions seldom happen;
or
2. a subtle flaw having to do with the RESHAPE function or the pseudo-variable "T" that is an argument to it. 

I created a simple script that fills the mr table with mr's and then hammers the heck out of the cached variables in new calculations. It is designed so that a wrong answer is immediately visible. No matter what complexity I added, though, I could not replicate the problem that Ned's script is triggering. Here is the test script, should anyone want to add more to it:

! test_auto_cache_delete.jnl  [1=make new files] [# of files]

! 1/2017 - try to force Ferret into a complex juggling of dependencies
! and cahce maintenance in order to reproduce the problem seen in trac
! ticket 2485 (from Ned Cokelet)

define symbol nsets ($2"100")
define axis/x=1:100:1 x100

if ($1"0") then
!**************************************
 define axis/x=1:10:1 x10
 define axis/y=1:10:1 y10
 let shape = 0*(X[gx=x10]+Y[gy=y10])
 let one      = 1 + shape
 let ten      = 10 + shape
 let hundred  = 100 + shape
 let thousand = 1000 + shape
 let tenthou  = 10000 + shape
 let hundthou = 100000 + shape
 let million  = 1000000 + shape

 REPEAT/k=1:($nsets)  \
 save/file=test_auto_`k`.nc/clobber one, ten, hundred, thousand, tenthou, hundthou, million

 cancel var/all
!**************************************
endif

REPEAT/k=1:($nsets)  USE test_auto_`k`.nc

let  r_one      = RESHAPE(one,X[gx=x100])
let  r_ten      = RESHAPE(ten,X[gx=x100])
let  r_hundred  = RESHAPE(hundred,X[gx=x100])
let  r_thousand = RESHAPE(thousand,X[gx=x100])
let  r_tenthou  = RESHAPE(tenthou,X[gx=x100])
let  r_hundthou = RESHAPE(hundthou,X[gx=x100])
let  r_million  = RESHAPE(million,X[gx=x100])

let one_2      = r_one
let ten_2      = r_ten
let hundred_2  = r_hundred 
let thousand_2 = r_thousand
let tenthou_2  = r_tenthou 
let hundthou_2 = r_hundthou 
let million_2  = r_million

let one_3      = r_one
let ten_3      = r_ten
let hundred_3  = r_hundred 
let thousand_3 = r_thousand
let tenthou_3  = r_tenthou 
let hundthou_3 = r_hundthou 
let million_3  = r_million

let result_31 = \
one_3 + ten_3 + hundred_3 + thousand_3 + tenthou_3 + hundthou_3 + million_3

let result_21 = \
one_2 + ten_2 + hundred_2 + thousand_2 + tenthou_2 + hundthou_2 + million_2

let result_32 = \
one_3 + ten_3 + hundred_3 + thousand_3 + tenthou_3 + hundthou_3 + million_3

let result_22 = \
one_2 + ten_2 + hundred_2 + thousand_2 + tenthou_2 + hundthou_2 + million_2

let result = result_31 + result_21 + result_32 + result_22 + \
r_one + r_ten + r_hundred + r_thousand + r_tenthou + r_hundthou + r_million

sp rm -f test_auto.out
SET MODE DIAG
REPEAT/k=1:($nsets)  LOAD result[d=test_auto_`k`.nc]
REPEAT/k=1:($nsets)  list/nohead/i=50/file=test_auto.out/append result[d=test_auto_`k`.nc]

vtree result
karlmsmith commented 6 years ago

Attachment from @AnsleyManke on 16 Nov 2016 01:41 UTC ensuring we have the example scripts that fail ned2.tar.gz

karlmsmith commented 6 years ago

In investigating #1886, the function HASH_NAME(name, npiles) is called. (There is also a TM_HASH_NAME that is basically the same.) At some point the value returned was used as an index into an array, and if there is a valid value at that location, the variables are assumed to be the same. This is fine if the array is large relative to the number of variables as there is very small chance of collision. But with supporting more variables, if the array size is not increased, the chance of collision increases. I am suspicious that a collision in this is the source of using the incorrect variable.