Alexander-Barth / NCDatasets.jl

Load and create NetCDF files in Julia
MIT License
149 stars 32 forks source link

Segmentation fault reading file with NCDatasets v0.12.6 #187

Closed sjdaines closed 2 years ago

sjdaines commented 2 years ago

Describe the bug

Julia exits with segmentation fault when attempting to read from a netcdf file, apparently at random.

To Reproduce No issues seen with NCdatasets v0.12.5 (with NetCDF_jll v400.702.400+0, using Julia v1.7), or with earlier versions (going back to approx ~1yr ago).

Occurs with NCDatasets v0.12.6 (with NetCDF_jll v400.902.5+0), using either Julia v1.7 or v1.8

This happens while running an application that is repeatedly opening and closing two netcdf files. Fails seemingly at random while opening either file after successfully open/read/close for ~10 attempts, doesn't seem to be associated with opening and reading any particular field or file.

Apologies, this isn't an example or dataset I can share. The code is of the form:

NCDatasets.Dataset(netcdf_filename) do ds
        prepare_data(ds)
end

and it looks like the failure is when opening the netcdf file (see stacktrace below)

julia> Pkg.test("NCDatasets") passes all tests.

Environment

Full output

signal (11): Segmentation fault in expression starting at /data/sd336/runtests.jl:11 posixio_open at /data/sd336/software/julia/depot/artifacts/b51d396f750184183d5594c558b417883f749e07/lib/libnetcdf.so (unknown line) ncio_open at /data/sd336/software/julia/depot/artifacts/b51d396f750184183d5594c558b417883f749e07/lib/libnetcdf.so (unknown line) NC3_open at /data/sd336/software/julia/depot/artifacts/b51d396f750184183d5594c558b417883f749e07/lib/libnetcdf.so (unknown line) NC_open at /data/sd336/software/julia/depot/artifacts/b51d396f750184183d5594c558b417883f749e07/lib/libnetcdf.so (unknown line) nc_open at /data/sd336/software/julia/depot/artifacts/b51d396f750184183d5594c558b417883f749e07/lib/libnetcdf.so (unknown line) nc_open at /data/sd336/software/julia/depot/packages/NCDatasets/gV1QR/src/netcdf_c.jl:267 unknown function (ip: 0x7f91e3a5fcf9) _jl_invoke at /buildworker/worker/package_linux64/build/src/gf.c:2247 [inlined] jl_apply_generic at /buildworker/worker/package_linux64/build/src/gf.c:2429

NCDataset#12 at /data/sd336/software/julia/depot/packages/NCDatasets/gV1QR/src/dataset.jl:203

NCDataset at /data/sd336/software/julia/depot/packages/NCDatasets/gV1QR/src/dataset.jl:172 [inlined] NCDataset at /data/sd336/software/julia/depot/packages/NCDatasets/gV1QR/src/dataset.jl:172 [inlined]

NCDataset#13 at /data/sd336/software/julia/depot/packages/NCDatasets/gV1QR/src/dataset.jl:239

NCDataset at /data/sd336/software/julia/depot/packages/NCDatasets/gV1QR/src/dataset.jl:239 [inlined] prepare_do_force_grid at /data/sd336/software/julia/depot/packages/PALEOboxes/iA0AD/src/reactioncatalog/GridForcings.jl:125

...etc...

Alexander-Barth commented 2 years ago

As a test, does the error persists if you comment-out this line? (assuming that you do not need OPENDAP over HTTPS support):

https://github.com/Alexander-Barth/NCDatasets.jl/blob/master/src/NCDatasets.jl#L31

If the error is still present without the call to init_certificate_authority(), can you provide a minimum reproducible example? I don't need the whole data set or the complete function prepare_data just a minimal one, possibly with random data which still exhibit the segfault.

sjdaines commented 2 years ago

The error is still there if I comment out init_certificate_authority(), this is using Julia 1.7.3

Here's a cut down (although probably not minimal) code example that still fails, although less frequently than the full app. The two netcdf files here report as classic using ncdump -k (I'll see if I can reproduce this with files I can share):

file testnc2.jl contains:

import NCDatasets

dataarrays = []

niter = 1

netcdf_filename1 = "unshareable_classic_netcdf_1.nc"
fields1 = ["time"]
netcdf_filename2 = "unshareable_classic_netcdf_2.nc"
fields2 = ["time", "phys_ocn_v"]

@noinline function prepare_data(darrays, fields, ds)
    for f in fields
        push!(darrays, ds[f][:])
    end
end

while niter < 100
    println("niter: ", niter)

    NCDatasets.Dataset(netcdf_filename1) do ds
        prepare_data(dataarrays, fields1, ds)
    end

    NCDatasets.Dataset(netcdf_filename2) do ds
        prepare_data(dataarrays, fields2, ds)
    end

    global niter += 1
end

and then the test was:

julia> nouter = 1
julia> while true; println("nouter: ", nouter);include("testnc2.jl");global nouter += 1;end

Example stacktrace (this is the most common failure, although it can fail in different ways, see below):

...
nouter: 34
...
niter: 80

signal (11): Segmentation fault
in expression starting at /data/sd336/PALEOdev.jl/PALEOexamples/testnc2.jl:18
posixio_open at /data/sd336/software/julia/depot/artifacts/b51d396f750184183d5594c558b417883f749e07/lib/libnetcdf.so (unknown line)
ncio_open at /data/sd336/software/julia/depot/artifacts/b51d396f750184183d5594c558b417883f749e07/lib/libnetcdf.so (unknown line)
NC3_open at /data/sd336/software/julia/depot/artifacts/b51d396f750184183d5594c558b417883f749e07/lib/libnetcdf.so (unknown line)
NC_open at /data/sd336/software/julia/depot/artifacts/b51d396f750184183d5594c558b417883f749e07/lib/libnetcdf.so (unknown line)
nc_open at /data/sd336/software/julia/depot/artifacts/b51d396f750184183d5594c558b417883f749e07/lib/libnetcdf.so (unknown line)
nc_open at /data/sd336/NCDatasets.jl/src/netcdf_c.jl:267
unknown function (ip: 0x7fd81440e259)
_jl_invoke at /buildworker/worker/package_linux64/build/src/gf.c:2247 [inlined]
jl_apply_generic at /buildworker/worker/package_linux64/build/src/gf.c:2429
#NCDataset#12 at /data/sd336/NCDatasets.jl/src/dataset.jl:203
NCDataset at /data/sd336/NCDatasets.jl/src/dataset.jl:172 [inlined]
NCDataset at /data/sd336/NCDatasets.jl/src/dataset.jl:172 [inlined]
#NCDataset#13 at /data/sd336/NCDatasets.jl/src/dataset.jl:239
_jl_invoke at /buildworker/worker/package_linux64/build/src/gf.c:2247 [inlined]
jl_apply_generic at /buildworker/worker/package_linux64/build/src/gf.c:2429
NCDataset at /data/sd336/NCDatasets.jl/src/dataset.jl:239
_jl_invoke at /buildworker/worker/package_linux64/build/src/gf.c:2247 [inlined]
jl_apply_generic at /buildworker/worker/package_linux64/build/src/gf.c:2429
jl_apply at /buildworker/worker/package_linux64/build/src/julia.h:1788 [inlined]
do_call at /buildworker/worker/package_linux64/build/src/interpreter.c:126
eval_value at /buildworker/worker/package_linux64/build/src/interpreter.c:215
...
sjdaines commented 2 years ago

A couple of example of less common failures (with init_certificate_authority() commented out, using Julia 1.7.3)

With a similar, but not identical, test script:

nouter: 12
niter: 1
niter: 2
niter: 3
niter: 4
niter: 5
niter: 6
niter: 7
niter: 8
niter: 9
niter: 10
niter: 11
*** Error in `julia': double free or corruption (out): 0x000000005bc3eb60 ***
======= Backtrace: =========
/lib/x86_64-linux-gnu/libc.so.6(+0x777f5)[0x7ff2b16797f5]
/lib/x86_64-linux-gnu/libc.so.6(+0x8038a)[0x7ff2b168238a]
/lib/x86_64-linux-gnu/libc.so.6(cfree+0x4c)[0x7ff2b168658c]
/data/sd336/software/julia/depot/artifacts/b51d396f750184183d5594c558b417883f749e07/lib/libnetcdf.so(free_NC+0x30)[0x7ff243acbb69]
/data/sd336/software/julia/depot/artifacts/b51d396f750184183d5594c558b417883f749e07/lib/libnetcdf.so(NC_open+0x4a3)[0x7ff243abacdd]
/data/sd336/software/julia/depot/artifacts/b51d396f750184183d5594c558b417883f749e07/lib/libnetcdf.so(nc_open+0x3f)[0x7ff243ab9cc0]
[0x7ff24560fd23]
[0x7ff24560fefa]
/data/biogeochemdata/software/julia-1.7.3/bin/../lib/julia/libjulia-internal.so.1(jl_apply_generic+0x1fa)[0x7ff2b083fe7a]
[0x7ff24560d257]
[0x7ff245625b1d]
/data/biogeochemdata/software/julia-1.7.3/bin/../lib/julia/libjulia-internal.so.1(jl_apply_generic+0x1fa)[0x7ff2b083fe7a]
/data/biogeochemdata/software/julia-1.7.3/bin/../lib/julia/libjulia-internal.so.1(+0xd71c9)[0x7ff2b08521c9]
[0x7ff245607a4a]
/data/biogeochemdata/software/julia-1.7.3/bin/../lib/julia/libjulia-internal.so.1(jl_apply_generic+0x1fa)[0x7ff2b083fe7a]
[0x7ff2456253f3]
[0x7ff24562583d]
/data/biogeochemdata/software/julia-1.7.3/bin/../lib/julia/libjulia-internal.so.1(+0x1039ea)[0x7ff2b087e9ea]
/data/biogeochemdata/software/julia-1.7.3/bin/../lib/julia/libjulia-internal.so.1(+0x102b35)[0x7ff2b087db35]
/data/biogeochemdata/software/julia-1.7.3/bin/../lib/julia/libjulia-internal.so.1(jl_toplevel_eval_in+0xaa)[0x7ff2b087f77a]
/data/biogeochemdata/software/julia-1.7.3/lib/julia/sys.so(+0x11b79eb)[0x7ff29cb149eb]
/data/biogeochemdata/software/julia-1.7.3/bin/../lib/julia/libjulia-internal.so.1(jl_apply_generic+0x1fa)[0x7ff2b083fe7a]
/data/biogeochemdata/software/julia-1.7.3/lib/julia/sys.so(+0x11fa16b)[0x7ff29cb5716b]
[0x7ff2456065ac]
/data/biogeochemdata/software/julia-1.7.3/bin/../lib/julia/libjulia-internal.so.1(+0x1039ea)[0x7ff2b087e9ea]
/data/biogeochemdata/software/julia-1.7.3/bin/../lib/julia/libjulia-internal.so.1(+0x102b35)[0x7ff2b087db35]
/data/biogeochemdata/software/julia-1.7.3/bin/../lib/julia/libjulia-internal.so.1(jl_toplevel_eval_in+0xaa)[0x7ff2b087f77a]
/data/biogeochemdata/software/julia-1.7.3/lib/julia/sys.so(+0xf4f1b3)[0x7ff29c8ac1b3]
/data/biogeochemdata/software/julia-1.7.3/lib/julia/sys.so(+0xf4f9d5)[0x7ff29c8ac9d5]
/data/biogeochemdata/software/julia-1.7.3/lib/julia/sys.so(+0x824d3d)[0x7ff29c181d3d]
/data/biogeochemdata/software/julia-1.7.3/lib/julia/sys.so(+0x839692)[0x7ff29c196692]
/data/biogeochemdata/software/julia-1.7.3/lib/julia/sys.so(+0x83994c)[0x7ff29c19694c]
/data/biogeochemdata/software/julia-1.7.3/bin/../lib/julia/libjulia-internal.so.1(jl_apply_generic+0x1fa)[0x7ff2b083fe7a]
/data/biogeochemdata/software/julia-1.7.3/lib/julia/sys.so(+0x8e61ab)[0x7ff29c2431ab]
/data/biogeochemdata/software/julia-1.7.3/lib/julia/sys.so(+0x8e624c)[0x7ff29c24324c]
/data/biogeochemdata/software/julia-1.7.3/bin/../lib/julia/libjulia-internal.so.1(jl_apply_generic+0x1fa)[0x7ff2b083fe7a]
/data/biogeochemdata/software/julia-1.7.3/bin/../lib/julia/libjulia-internal.so.1(jl_f__call_latest+0x47)[0x7ff2b0850647]
/data/biogeochemdata/software/julia-1.7.3/lib/julia/sys.so(+0x12ef06f)[0x7ff29cc4c06f]
/data/biogeochemdata/software/julia-1.7.3/lib/julia/sys.so(+0x12fa59d)[0x7ff29cc5759d]
/data/biogeochemdata/software/julia-1.7.3/lib/julia/sys.so(+0xd97068)[0x7ff29c6f4068]
/data/biogeochemdata/software/julia-1.7.3/lib/julia/sys.so(+0xd971d9)[0x7ff29c6f41d9]
/data/biogeochemdata/software/julia-1.7.3/bin/../lib/julia/libjulia-internal.so.1(jl_apply_generic+0x1fa)[0x7ff2b083fe7a]
/data/biogeochemdata/software/julia-1.7.3/bin/../lib/julia/libjulia-internal.so.1(+0x128426)[0x7ff2b08a3426]
/data/biogeochemdata/software/julia-1.7.3/bin/../lib/julia/libjulia-internal.so.1(jl_repl_entrypoint+0x8d)[0x7ff2b08a3dcd]
julia(main+0x9)[0x4007d9]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf0)[0x7ff2b1622840]
julia[0x400809]
======= Memory map: ========
...

With the full app:

signal (11): Segmentation fault
in expression starting at /data/sd336/runtests.jl:11
strlen at /lib/x86_64-linux-gnu/libc.so.6 (unknown line)
processuri at /data/sd336/software/julia/depot/artifacts/b51d396f750184183d5594c558b417883f749e07/lib/libnetcdf.so (unknown line)
NC_infermodel at /data/sd336/software/julia/depot/artifacts/b51d396f750184183d5594c558b417883f749e07/lib/libnetcdf.so (unknown line)
NC_open at /data/sd336/software/julia/depot/artifacts/b51d396f750184183d5594c558b417883f749e07/lib/libnetcdf.so (unknown line)
nc_open at /data/sd336/software/julia/depot/artifacts/b51d396f750184183d5594c558b417883f749e07/lib/libnetcdf.so (unknown line)
nc_open at /data/sd336/NCDatasets.jl/src/netcdf_c.jl:267
unknown function (ip: 0x7f5c93d44cc9)
_jl_invoke at /buildworker/worker/package_linux64/build/src/gf.c:2247 [inlined]
jl_apply_generic at /buildworker/worker/package_linux64/build/src/gf.c:2429
#NCDataset#12 at /data/sd336/NCDatasets.jl/src/dataset.jl:203
NCDataset at /data/sd336/NCDatasets.jl/src/dataset.jl:172 [inlined]
NCDataset at /data/sd336/NCDatasets.jl/src/dataset.jl:172 [inlined]
#NCDataset#13 at /data/sd336/NCDatasets.jl/src/dataset.jl:239
unknown function (ip: 0x7f5c93d4c939)
_jl_invoke at /buildworker/worker/package_linux64/build/src/gf.c:2247 [inlined]
jl_apply_generic at /buildworker/worker/package_linux64/build/src/gf.c:2429
NCDataset at /data/sd336/NCDatasets.jl/src/dataset.jl:239
#CartesianGrid#29 at /data/sd336/software/julia/depot/packages/PALEOboxes/iA0AD/src/Grids.jl:597
CartesianGrid at /data/sd336/software/julia/depot/packages/PALEOboxes/iA0AD/src/Grids.jl:591 [inlined]
sjdaines commented 2 years ago

Also fails with a minimal netcdf file coords.nc (attached), although much less frequently, and in a different place. This file reports as netCDF-4 with ncdump -k

Test script is modified with:

netcdf_filename1 = "coords.nc"
fields1 = ["latitude"] 
netcdf_filename2 = "coords.nc"
fields2 = ["latitude", "longitude"]

Example stack trace of failure:

...
nouter: 4241
...
niter: 99
...

signal (11): Segmentation fault
in expression starting at /data/sd336/testnc3.jl:18
strlen at /lib/x86_64-linux-gnu/libc.so.6 (unknown line)
ncindexadd at /data/sd336/software/julia/depot/artifacts/b51d396f750184183d5594c558b417883f749e07/lib/libnetcdf.so (unknown line)
nc4_att_list_add at /data/sd336/software/julia/depot/artifacts/b51d396f750184183d5594c558b417883f749e07/lib/libnetcdf.so (unknown line)
att_read_callbk at /data/sd336/software/julia/depot/artifacts/b51d396f750184183d5594c558b417883f749e07/lib/libnetcdf.so (unknown line)
H5A__attr_iterate_table at /data/sd336/software/julia/depot/artifacts/eb2e2f806144b25e5db6cc8941eb33b9a13f58a3/lib/libhdf5.so (unknown line)
H5O_attr_iterate_real at /data/sd336/software/julia/depot/artifacts/eb2e2f806144b25e5db6cc8941eb33b9a13f58a3/lib/libhdf5.so (unknown line)
H5O__attr_iterate at /data/sd336/software/julia/depot/artifacts/eb2e2f806144b25e5db6cc8941eb33b9a13f58a3/lib/libhdf5.so (unknown line)
H5A__iterate_common at /data/sd336/software/julia/depot/artifacts/eb2e2f806144b25e5db6cc8941eb33b9a13f58a3/lib/libhdf5.so (unknown line)
H5A__iterate at /data/sd336/software/julia/depot/artifacts/eb2e2f806144b25e5db6cc8941eb33b9a13f58a3/lib/libhdf5.so (unknown line)
H5VL__native_attr_specific at /data/sd336/software/julia/depot/artifacts/eb2e2f806144b25e5db6cc8941eb33b9a13f58a3/lib/libhdf5.so (unknown line)
H5VL__attr_specific.isra.0 at /data/sd336/software/julia/depot/artifacts/eb2e2f806144b25e5db6cc8941eb33b9a13f58a3/lib/libhdf5.so (unknown line)
H5VL_attr_specific at /data/sd336/software/julia/depot/artifacts/eb2e2f806144b25e5db6cc8941eb33b9a13f58a3/lib/libhdf5.so (unknown line)
H5Aiterate2 at /data/sd336/software/julia/depot/artifacts/eb2e2f806144b25e5db6cc8941eb33b9a13f58a3/lib/libhdf5.so (unknown line)
nc4_read_atts at /data/sd336/software/julia/depot/artifacts/b51d396f750184183d5594c558b417883f749e07/lib/libnetcdf.so (unknown line)
nc4_hdf5_find_grp_var_att at /data/sd336/software/julia/depot/artifacts/b51d396f750184183d5594c558b417883f749e07/lib/libnetcdf.so (unknown line)
NC4_HDF5_inq_var_all at /data/sd336/software/julia/depot/artifacts/b51d396f750184183d5594c558b417883f749e07/lib/libnetcdf.so (unknown line)
nc_inq_var at /data/sd336/software/julia/depot/artifacts/b51d396f750184183d5594c558b417883f749e07/lib/libnetcdf.so (unknown line)
nc_inq_varname at /data/sd336/software/julia/depot/artifacts/b51d396f750184183d5594c558b417883f749e07/lib/libnetcdf.so (unknown line)
nc_inq_varname at /data/sd336/NCDatasets.jl/src/netcdf_c.jl:1515
listVar at /data/sd336/NCDatasets.jl/src/variable.jl:12
keys at /data/sd336/NCDatasets.jl/src/dataset.jl:258 [inlined]
initboundsmap! at /data/sd336/NCDatasets.jl/src/dataset.jl:80
NCDataset#1 at /data/sd336/NCDatasets.jl/src/types.jl:109
NCDataset at /data/sd336/NCDatasets.jl/src/types.jl:90 [inlined]
#NCDataset#12 at /data/sd336/NCDatasets.jl/src/dataset.jl:227
NCDataset at /data/sd336/NCDatasets.jl/src/dataset.jl:172 [inlined]
NCDataset at /data/sd336/NCDatasets.jl/src/dataset.jl:172 [inlined]
#NCDataset#13 at /data/sd336/NCDatasets.jl/src/dataset.jl:239
_jl_invoke at /buildworker/worker/package_linux64/build/src/gf.c:2247 [inlined]
jl_apply_generic at /buildworker/worker/package_linux64/build/src/gf.c:2429
NCDataset at /data/sd336/NCDatasets.jl/src/dataset.jl:239
_jl_invoke at /buildworker/worker/package_linux64/build/src/gf.c:2247 [inlined]

coords.zip

Alexander-Barth commented 2 years ago

Can you also test if the issue is also present in NCDatasets v0.12.6 with NetCDF_jll v400.702.400+0 by forcing to use the version ]add NetCDF_jll@400.702.400. Is it necessary to have the outer and inner loop, or can you just have a long inner loop?

A smaller reproducer:

mport NCDatasets                                                                                                                                                     

netcdf_filename1 = "coords.nc"                                                                                                                                        

total = 0.                                                                                                                                                            
niter = 0                                                                                                                                                             
tmp = zeros(Float32,90)                                                                                                                                               

while true                                                                                                                                                            
    global total, niter                                                                                                                                               
    (niter % 1000 == 0) && println("niter: ", niter)                                                                                                                  

    NCDatasets.Dataset(netcdf_filename1) do ds                                                                                                                        
        varid = 0                                                                                                                                                     
        NCDatasets.nc_get_var!(ds.ncid,varid,tmp)                                                                                                                     
        total += sum(tmp)                                                                                                                                             
    end                                                                                                                                                               

    niter += 1                                                                                                                                                        
end                                                                                                                                                                   

craches with:

niter: 922000                                                                                                                                                         

signal (11): Speicherzugriffsfehler                                                                                                                                   
in expression starting at /mnt/data1/abarth/.julia/dev/NCDatasets/test/test_segfault3.jl:9                                                                            
unknown function (ip: 0x7f1fee67f507)                                                                                                                                 
ncindexadd at /home/abarth/.julia/artifacts/b51d396f750184183d5594c558b417883f749e07/lib/libnetcdf.so (unknown line)                                                  
nc4_att_list_add at /home/abarth/.julia/artifacts/b51d396f750184183d5594c558b417883f749e07/lib/libnetcdf.so (unknown line)                                            
att_read_callbk at /home/abarth/.julia/artifacts/b51d396f750184183d5594c558b417883f749e07/lib/libnetcdf.so (unknown line)                                             
H5A__attr_iterate_table at /home/abarth/.julia/artifacts/eb2e2f806144b25e5db6cc8941eb33b9a13f58a3/lib/libhdf5.so (unknown line)                                       
H5O_attr_iterate_real at /home/abarth/.julia/artifacts/eb2e2f806144b25e5db6cc8941eb33b9a13f58a3/lib/libhdf5.so (unknown line)                                         
H5O__attr_iterate at /home/abarth/.julia/artifacts/eb2e2f806144b25e5db6cc8941eb33b9a13f58a3/lib/libhdf5.so (unknown line)        

(using NetCDF_jll v400.902.5+0)

Alexander-Barth commented 2 years ago

This is likely to be an upstream issue. https://github.com/Unidata/netcdf-c/issues/2486

sjdaines commented 2 years ago

I can reproduce the first failure above ('classic' format netcdf files, fails in nc_open) using a single publicly available test file downloaded from the Unidata website.

The 'inner' and 'outer' loop do seem to be necessary (at least to provoke a failure quickly, a test with a single loop is still running after >1500 outer iterations).

To me it looks like this is a different error and plausibly a different issue to the netCDF-4 case?

This is using: julia 1.7.3 NetCDF_jll v400.902.5+0 NCDatasets v0.12.7

(also there is no failure after changing to NetCDF_jll v400.702.400+0 using ]add NetCDF_jll@400.702.400, at least after 1000 iterations)

File testnc6.jl contains:

import NCDatasets

dataarrays = []

niter = 1

# Test file downloaded from 
# https://www.unidata.ucar.edu/software/netcdf/examples/files.html
# ('classic' format)
# https://www.unidata.ucar.edu/software/netcdf/examples/ECMWF_ERA-40_subset.nc
netcdf_filename2 = "/data/sd336/ECMWF_ERA-40_subset.nc"
fields2 = ["time", "tcw"]

@noinline function prepare_data(darrays, fields, ds)
    for f in fields
        push!(darrays, ds[f][:])
    end
end

while niter < 100
    println("niter: ", niter)

    NCDatasets.Dataset(netcdf_filename2) do ds
        prepare_data(dataarrays, fields2, ds)
    end

    global niter += 1
end

with test:

julia> nouter = 1
julia> while true; println("nouter: ", nouter);include("testnc6.jl");global nouter += 1;end

and stacktrace:


...
nouter: 132
...
niter: 52

signal (11): Segmentation fault
in expression starting at /data/sd336/testnc6.jl:24
posixio_open at /data/sd336/software/julia/depot/artifacts/b51d396f750184183d5594c558b417883f749e07/lib/libnetcdf.so (unknown line)
ncio_open at /data/sd336/software/julia/depot/artifacts/b51d396f750184183d5594c558b417883f749e07/lib/libnetcdf.so (unknown line)
NC3_open at /data/sd336/software/julia/depot/artifacts/b51d396f750184183d5594c558b417883f749e07/lib/libnetcdf.so (unknown line)
NC_open at /data/sd336/software/julia/depot/artifacts/b51d396f750184183d5594c558b417883f749e07/lib/libnetcdf.so (unknown line)
nc_open at /data/sd336/software/julia/depot/artifacts/b51d396f750184183d5594c558b417883f749e07/lib/libnetcdf.so (unknown line)
nc_open at /data/sd336/software/julia/depot/packages/NCDatasets/EkOvO/src/netcdf_c.jl:267
unknown function (ip: 0x7f91acdf93b9)
_jl_invoke at /buildworker/worker/package_linux64/build/src/gf.c:2247 [inlined]
jl_apply_generic at /buildworker/worker/package_linux64/build/src/gf.c:2429
#NCDataset#12 at /data/sd336/software/julia/depot/packages/NCDatasets/EkOvO/src/dataset.jl:203
NCDataset at /data/sd336/software/julia/depot/packages/NCDatasets/EkOvO/src/dataset.jl:172 [inlined]
NCDataset at /data/sd336/software/julia/depot/packages/NCDatasets/EkOvO/src/dataset.jl:172 [inlined]
#NCDataset#13 at /data/sd336/software/julia/depot/packages/NCDatasets/EkOvO/src/dataset.jl:239
_jl_invoke at /buildworker/worker/package_linux64/build/src/gf.c:2247 [inlined]
jl_apply_generic at /buildworker/worker/package_linux64/build/src/gf.c:2429
NCDataset at /data/sd336/software/julia/depot/packages/NCDatasets/EkOvO/src/dataset.jl:239
_jl_invoke at /buildworker/worker/package_linux64/build/src/gf.c:2247 [inlined]
jl_apply_generic at /buildworker/worker/package_linux64/build/src/gf.c:2429
jl_apply at /buildworker/worker/package_linux64/build/src/julia.h:1788 [inlined]
do_call at /buildworker/worker/package_linux64/build/src/interpreter.c:126
...
Alexander-Barth commented 2 years ago

Thank for your additional testing! There seem to be also an issue during initialization of NetCDF: https://github.com/Unidata/netcdf-c/issues/2486#issuecomment-1223701511

Hopefully this is the same problem, because this error happens right away.

Alexander-Barth commented 2 years ago

Can you test with https://github.com/Alexander-Barth/NetCDF_jll.jl/releases/tag/NetCDF-v400.902.29%2B0 ?

  1. start with an empty environment julia --project some_empty_folder
  2. install NCDatasets dev NCDatasets
  3. you need to comment out https://github.com/Alexander-Barth/NCDatasets.jl/blob/master/Project.toml#L19
  4. install the new NetCDF_jll via ]add https://github.com/Alexander-Barth/NetCDF_jll.jl

On my end, it does not crash any more after 3000 outer iterations using this reproducer:

import NCDatasets

dataarrays = []

niter = 1

netcdf_filename1 = "coords.nc"
fields1 = ["latitude"]
netcdf_filename2 = "coords.nc"
fields2 = ["latitude", "longitude"]

@noinline function prepare_data(darrays, fields, ds)
    for f in fields
        push!(darrays, ds[f][:])
    end
end

while niter < 100
    #println("niter: ", niter)

    NCDatasets.Dataset(netcdf_filename1) do ds
        prepare_data(dataarrays, fields1, ds)
    end

    NCDatasets.Dataset(netcdf_filename2) do ds
        prepare_data(dataarrays, fields2, ds)
    end

    global niter += 1
end

Run with:

nouter = 1; while true; println("nouter: ", nouter);include("testnc2.jl");global nouter += 1;end
sjdaines commented 2 years ago

Looks good using https://github.com/Alexander-Barth/NetCDF_jll.jl/releases/tag/NetCDF-v400.902.29%2B0 !!

I've run three tests:

  1. Reproducer with 'coords.nc' netCDF-4 file as above: 8000 outer iterations (cf failure at ~4000 outer iterations before)
  2. Reproducer with 'classic' format file downloaded from https://www.unidata.ucar.edu/software/netcdf/examples/ECMWF_ERA-40_subset.nc 3000 outer iterations (cf failure at ~150 outer iterations before)
  3. Our application test suite (that lead to the initial report); 2 runs (cf used to fail every time before halfway)
Alexander-Barth commented 2 years ago

Thanks a lot for this comprehensive testing! In this build a remove the NetCDF c-flag -std=c99.

sjdaines commented 2 years ago

Many thanks for addressing this issue, as well as your work on NCDatasets !

(and confirm all still looks good here after updating to the latest released packages)

Alexander-Barth commented 2 years ago

Great! Thank you testing and creating the reproducer! The new NetCDF_jll has been released. I think that an Pkg.update() should be sufficient to get it.