cedadev / jasmin_scivm

9 stars 0 forks source link

CDO crashing with grib_api using reduced Gaussian grid #73

Closed alaniwi closed 7 years ago

alaniwi commented 8 years ago

User-reported (see FP 38766).

Trying to free invalid pointer when accessing grib files with reduced Gaussian grid. (User reports that files with spectral and/or full Gaussian gridded data are okay.) This problem has been introduced in JAP 1.1-27.

Example:

cdo sinfo /badc/ecmwf-era-interim/data/gg/am/2012/10/29/ggam201210290600.grb

gives

*** glibc detected *** cdo: free(): invalid pointer: 0x000000002c6199e8 ***
alaniwi commented 8 years ago

Noting that cdo was built before grib_api, which is wrong. But rebuilding cdo (tested on jasmin-sci1-dev) has not fixed it. The build order of relevant(?) packages, with this fixed, is as follows:

[iwi@jasmin-sci1-dev ~]$ rpm -q  --queryformat="%{BUILDTIME} %{BUILDTIME:date} %{NAME}\n" grib_api netcdf cdo hdf5 udunits | sort -n
1449490276 Mon 07 Dec 2015 12:11:16 GMT hdf5
1460044475 Thu 07 Apr 2016 16:54:35 BST netcdf
1460066287 Thu 07 Apr 2016 22:58:07 BST udunits
1460066937 Thu 07 Apr 2016 23:08:57 BST grib_api
1462295346 Tue 03 May 2016 18:09:06 BST cdo

which should be okay(?)

alaniwi commented 8 years ago

Some traceback after disabling optimisations (and, for good measure, linking with -lefence, although this doesn't seem to affect the interesting part of the traceback - still the same bit of application code that triggers it)

#0  0x0000003540c32907 in kill () from /lib64/libc.so.6
#1  0x00007fb2543b61e5 in ?? () from /usr/lib64/libefence.so.0
#2  0x00007fb2543b675d in EF_Abort () from /usr/lib64/libefence.so.0
#3  0x00007fb2543b5a13 in free () from /usr/lib64/libefence.so.0
#4  0x00000000005ea116 in memFree (ptr=0x7fb254384848, file=0x8182f7 "grid.c", 
    functionname=0x819ae0 "grid_free_components", line=179) at dmemory.c:513
#5  0x00000000005eeaeb in grid_free_components (gridptr=0x7fb254393860) at grid.c:179
#6  0x00000000005ef9fc in gridDestroyKernel (gridptr=0x7fb254393860) at grid.c:579
#7  0x00000000005efa98 in gridDestroyP (gridptr=0x7fb254393860) at grid.c:603
#8  0x00000000005fe214 in reshListDestruct (namespaceID=0) at resource_handle.c:180
#9  0x00000000005fd5fd in namespaceDelete (namespaceID=0) at namespace.c:214
#10 0x00000000005fe343 in listDestroy () at resource_handle.c:197
#11 0x0000003540c35b22 in exit () from /lib64/libc.so.6
#12 0x0000003540c1ed64 in __libc_start_main () from /lib64/libc.so.6
#13 0x00000000004073c9 in _start ()

(all these calls in the application are within the libcdi/src directory of the code)

alaniwi commented 8 years ago

Bug not reproducible with cdo 1.7.0 (linked to all the same lib versions), so apparently only introduced with cdo 1.7.1.

alaniwi commented 8 years ago

Specifically, seems to be while freeing pointer gridptr->rowlon in grid_free_components. The pointer value is plausible compared to other pointers (e.g. gridptr = 0x7fb254393860, gridptr->rowlon = 0x7fb254384848), but electric fence claims "address not from malloc()".

alaniwi commented 8 years ago

Another user-supplied example:

cdo -M -f grb mergetime ggam201210290600.grb ggam201210291200.grb test.grb

using input files from /badc/ecmwf-era-interim/data/gg/am/2012/10/29 (http://dap.ceda.ac.uk/data/badc/ecmwf-era-interim/data/gg/am/2012/10/29/)

This aborts at exactly the same place as the sinfo example (trying to free the gridptr->rowlon pointer), and leaves an unusable output file. (Contrast with the same command under 1.7.0, which works.)

It is possible to work around the problem using -R (convert reduced to regular Gaussian grid), in which CDO 1.7.1 and 1.7.0 give identical output, but these output files are ~50% larger than with the reduced grid.

oembury commented 8 years ago

Note full Gaussian grid files are usually only ~10% larger than reduced Gaussian grid files

Also, if the command is run without the -f grb it seems to generate a valid reduced gaussian grid output, but still crash with invalid pointer

cdo -M mergetime ggam201210290600.grb ggam201210291200.grb test2.grb
alaniwi commented 8 years ago

Reply from the maintainer:

"This bug will be fixed in the next CDO release. See also: https://code.zmaw.de/issues/6780 " (although I can't access the link)

alaniwi commented 7 years ago

1.7.2 is now out, so in principle this should fix it.

alaniwi commented 7 years ago

cdo 1.7.2 builds, and the originally reported bug is fixed in the test command.

[builderdev@builder SPECS]$ cdo sinfo /tmp/ggam201210290600.grb
   File format : GRIB
    -1 : Institut Source   Steptype Levels Num    Points Num Dtype : Parameter ID
     1 : ECMWF    unknown  instant      60   1     88838   1  P16  : 133.128       
     2 : ECMWF    unknown  instant      60   1     88838   1  P16  : 203.128       
     3 : ECMWF    unknown  instant      60   1     88838   1  P16  : 246.128       
     4 : ECMWF    unknown  instant      60   1     88838   1  P16  : 247.128       
     5 : ECMWF    unknown  instant      60   1     88838   1  P8   : 248.128       
   Grid coordinates :
     1 : gaussian reduced         : points=88838  nlat=256  np=128
                              lat : 89.46282 to -89.46282 degrees_north
   Vertical coordinates :
     1 : hybrid                   : levels=60
                              lev : 1 to 60 by 1 level
                        available : vct
   Time coordinate :  1 step
     RefTime =  2012-10-29 06:00:00  Units = hours  Calendar = proleptic_gregorian
  YYYY-MM-DD hh:mm:ss  YYYY-MM-DD hh:mm:ss  YYYY-MM-DD hh:mm:ss  YYYY-MM-DD hh:mm:ss
  2012-10-29 06:00:00
cdo sinfo: Processed 5 variables over 1 timestep ( 0.02s )