NOAA-PMEL / Ferret

The Ferret program from NOAA/PMEL
https://ferret.pmel.noaa.gov/Ferret/
The Unlicense
55 stars 20 forks source link

Support a responsive Ctrl-C interrupt when reading dataset #1291

Open karlmsmith opened 6 years ago

karlmsmith commented 6 years ago

Reported by @AndrewWittenberg on 26 Dec 2012 22:07 UTC Ctrl-C works as advertised and stops the data I/O -- until I try to issue the next command:

    NOAA/PMEL TMAP
    FERRET v6.84  
    Linux 2.6.32-279.14.1.el6.x86_64 64-bit - 12/18/12
    26-Dec-12 17:02     

yes? use "/net2/atw/archive/CM2.1U_Control-1860_D4/pp/atmos/ts/monthly/100yr/atmos.t_surf.nc"
 *** NOTE: Evenly spaced axis has edges definition: lon - ignored
 *** NOTE: Units on axis "nv" are not recognized: none
 *** NOTE: They will not be convertible:
yes? shade/y=0 t_surf
^C **TMAP ERR: 
             Reading variable t_surf, interrupted from command line
             Data set: /net2/atw/archive/CM2.1U_Control-1860_D4/pp/atmos/ts/monthly/100yr/atmos.t_surf.nc
** INTERRUPTED! **
yes? shade/y=0 t_surf
ferret_v6.84: posixio.c:265: px_pgin: Assertion `*posp == ((off_t)(-1)) || *posp == lseek(nciop->fd, 0, 1)' failed.
Abort

Migrated-From: http://dunkel.pmel.noaa.gov/trac/ferret/ticket/2019

karlmsmith commented 6 years ago

Comment by @AnsleyManke on 6 Feb 2013 18:32 UTC A local example. Incorrectly ask to SHADE a 3-D variable, then hit Ctrl-C. Try same command and we get the abort. Karl, I'm giving this one over to you.

Note that if we let the variable load, then it's in disk cache and the next time the script runs, it won't crash on the second "LOAD" but also the CTRL-C doesn't work.


yes? use "/home/data/socat/SOCAT_tracks_gridded_stats_AD.nc"
yes? sh dat
     currently SET data sets:
    1> /home/data/socat/SOCAT_tracks_gridded_stats_AD.nc  (default)
 name     title                             I         J         K         L         M         N
 COUNT_CRUISE
          Number of cruises                1:360     1:180     ...       1:480     ...       ...
 FCO2_CRUISE_AVE
          fco2 mean - per cruise weighted  1:360     1:180     ...       1:480     ...       ...
(etc)

yes? set mem/siz=300
 Cached data cleared from memory
yes? shade LON_OFFSET_UNWTD
 **ERROR: variable unknown or not in data set: LON_OFFSET_UNWTD
yes? 
yes? shade FCO2_CRUISE_AVE
^C  **TMAP ERR: 
             Reading variable FCO2_CRUISE_AVE, interrupted from command line
             Data set: /home/data/socat/SOCAT_tracks_gridded_stats_AD.nc
** INTERRUPTED! **

yes? shade FCO2_CRUISE_AVE
ferret: posixio.c:286: px_pgin: Assertion `*posp == ((off_t)(-1)) || *posp == lseek(nciop->fd, 0, 1)' failed.
Abort
karlmsmith commented 6 years ago

Comment by @karlmsmith on 7 Nov 2013 16:41 UTC Added ignoring any more Ctrl-C entries while processing the Ctrl-C to PyFerret code (in conjunction with ticket #1904). I will also make this change in Ferret code and see where this issue stands.

karlmsmith commented 6 years ago

Comment by @karlmsmith on 7 Nov 2013 23:51 UTC The problem is that the opened netCDF file (opened in the "USE ..." command) is in a corrupted state after the Ctrl-C interruption of the read statement. Ansley and I checked but did not see netCDF file equivalent to a Fortran REWIND to clear this error. The nc_sync command appears to be strictly for writing.

I think the only solution for recovering from this is to close the netCDF file and then reopen in. If we were sure the netCDF ID for the file remained the same, then this could be done at the point in the read-variable code where the interrupt was caught. But if the netCDF ID changes, then either the close and open needs to be done further up the chain, or we need to make sure the new ID (which would be handed back, since it is Fortran calling C code) is stored in the appropriate places.

To demonstrate that this works, immediately after the Ctrl-C that interrupted reading a variable, CANCEL the dataset, then re-USE it. There are no issues with closing the dataset on the CANCEL command and, as one would expect, the new USE command and subsequent commands work fine.

karlmsmith commented 6 years ago

Comment by @karlmsmith on 8 Nov 2013 04:40 UTC I am not seeing anything at Unidata about interrupting reads. A Google search of "interrupt netcdf read" show one message from Unidata (the top hit). It is for reading with Java, but even there they say you have to kill the thread doing the reading (in other words, kill everything about the file and the read and restart from scratch). The second hit is perl code for converting a NetCDF file; again it catches interrupts externally and just exits the program. The third hit is our own Ferret documentation. Nothing more with interrupt in it.

I realized the fact that you can't interrupt a NetCDF read indicates they must be intentionally ignoring interrupts. Standard file reads can be interrupted. This makes me more pessimistic about them providing any support for interrupting a read.

Just on the off-chance, I tried using nc_sync to see if would clear the erroneous state. (Ends up nc_sync can be used on read-only files to update in case the file contents changed.) But it doesn't clear the error; it crashes.

A call to NF_CLOSE does work (as expected from being able to cancel the dataset), but I am not sure what name to use with NF_OPEN (to try reopening in place with minimal impact on Ferret). There appears to be a number of NF_OPEN calls in the code, getting the name from various places. I assume there must be a place the name to use for the NF_OPEN is stored, but I am not seeing it. I do see there the NetCDF file ID is stored and retrieved, so that is not a problem.

karlmsmith commented 6 years ago

Comment by @karlmsmith on 8 Nov 2013 23:25 UTC After much discussion, decided to roll-back the Ctrl-C catching. To the best of our knowledge, the NetCDF libraries do not support interrupts. To just close (and possibly reopen) the NetCDF file makes the assumption that nothing else in the library has been corrupted by this "jumping out" of a NetCDF read. We do not know that to be the case. Instead we need to work with the NetCDF developers to get a NetCDF solution to this problem.

BTW, I realized that I was wrong about NetCDF setting to ignore Ctrl-C. If they had code to ignore Ctrl-C, my signal catching would not work because it would have been replaced with an "ignore".

Will reopen ticket #1976 which was the origin of this change, but was dealing with catching invalid fill or shade before reading the variable. Leaving this ticket open to deal with interrupting the reading of a variable in NetCDF, but modifying the title and setting the priority and severity back to normal with the code roll-back.

karlmsmith commented 6 years ago

Comment by @karlmsmith on 8 Nov 2013 23:32 UTC Created ticket #2108 instead of reopening #1976.