Unidata / netcdf4-python

netcdf4-python: python/numpy interface to the netCDF C library
http://unidata.github.io/netcdf4-python
MIT License
755 stars 262 forks source link

Segfaults when using VLEN arrays and not closing datasets on Python 3 #261

Open shoyer opened 10 years ago

shoyer commented 10 years ago

As described here: https://github.com/Unidata/netcdf4-python/issues/218#issuecomment-43287973

The segmentation faults appear when attempting to read array values from a netCDF4.Variable with dtype=str when previous datasets were not closed.

Here is a Travis log that should be (in principle) sufficient for reproducing this.... when I have time, I will attempt to make a simpler test case: https://travis-ci.org/shoyer/xray/jobs/25466389#L120

jdemaria commented 9 years ago

Hi,

I also suffer the same bug, reproducible with this very simple script:

issue261.py: import netCDF4 as nc for i in xrange(1, 33): print(i) d = nc.Dataset('issue261.nc')

with issue261.nc generated this way: ncgen -b -k netCDF-4 issue261.cdl issue261.cdl: netcdf issue261 { dimensions: one = 1 ; variables: string v(one) ; }

segfault trace on gdb: gdb python issue261.py GNU gdb 6.8-debian Copyright (C) 2008 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later http://gnu.org/licenses/gpl.html This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Type "show copying" and "show warranty" for details. This GDB was configured as "x86_64-linux-gnu"... (gdb) run Starting program: python issue261.py [Thread debugging using libthread_db enabled] [New Thread 0x7f26661c26e0 (LWP 13946)] [New Thread 0x41d1f950 (LWP 13949)] [New Thread 0x42520950 (LWP 13950)] [New Thread 0x42d21950 (LWP 13951)] 1 2 3 4 5 6 7 8 9 10

Program received signal SIGSEGV, Segmentation fault. [Switching to Thread 0x7f26661c26e0 (LWP 13946)] 0x00007f26606259e4 in H5F_addr_decode () from libhdf5.so.9 Current language: auto; currently asm (gdb) where

0 0x00007f26606259e4 in H5F_addr_decode () from libhdf5.so.9

1 0x00007f26607cd00c in H5T_vlen_disk_isnull () from libhdf5.so.9

2 0x00007f26607b57ee in H5T__conv_vlen () from libhdf5.so.9

3 0x00007f2660736849 in H5T_convert () from libhdf5.so.9

4 0x00007f2660609a36 in H5D_get_create_plist () from libhdf5.so.9

5 0x00007f26605f48ea in H5Dget_create_plist () from libhdf5.so.9

6 0x00007f26617c79ff in read_var (grp=0x2598680, datasetid=83886081, obj_name=0x7fff6e1d4cd4 "v", ndims=, dim=0x0) at nc4file.c:1546

7 0x00007f26617c8e04 in nc4_rec_read_metadata_cb (grpid=, name=, info=, _op_data=)

at nc4file.c:1900

8 0x00007f2660661ec3 in H5G_iterate_cb () from libhdf5.so.9

9 0x00007f2660663ef7 in H5G__link_iterate_table () from libhdf5.so.9

10 0x00007f266065a1ec in H5G__compact_iterate () from libhdf5.so.9

11 0x00007f266066b02b in H5G__obj_iterate () from libhdf5.so.9

12 0x00007f2660663500 in H5G_iterate () from libhdf5.so.9

13 0x00007f266069ff0b in H5Literate () from libhdf5.so.9

14 0x00007f26617c6fb9 in nc4_rec_read_metadata (grp=0x2598680) at nc4file.c:2096

15 0x00007f26617c765b in NC4_open (path=0x7f2655ca6144 "issue261.nc", mode=, basepe=, chunksizehintp=,

use_parallel=<value optimized out>, mpidata=<value optimized out>, dispatch=0x7f2661a72320, nc_file=0x25a0df0) at nc4file.c:2261

16 0x00007f2661773913 in NC_open (path=0x7f2655ca6144 "issue261.nc", cmode=4096, basepe=0, chunksizehintp=0x0, useparallel=0, mpi_info=0x0, ncidp=0x7fff6e1d584c)

at dfile.c:1777

17 0x00007f2661773bb7 in nc_open (path=0x2508930 "", mode=1847404544, ncidp=) at dfile.c:589

18 0x00007f2664b54a24 in pyx_pw_7netCDF4_7Dataset_1init (pyx_v_self=0x7f2655205d60, __pyx_args=0x7f2655228fd0, __pyx_kwds=) at netCDF4.c:22619

19 0x00007f2665c8aebe in type_call (type=, args=0x7f2655228fd0, kwds=0x0) at Objects/typeobject.c:743

20 0x00007f2665c24ed8 in PyObject_Call (func=0x7f2664da08e0, arg=0x7f2655228fd0, kw=0x0) at Objects/abstract.c:2529

21 0x00007f2665cd605c in PyEval_EvalFrameEx (f=0x7f266616a050, throwflag=) at Python/ceval.c:4251

22 0x00007f2665cdc6d1 in PyEval_EvalCodeEx (co=0x7f26660e24b0, globals=, locals=, args=0x0, argcount=0, kws=0x0, kwcount=0,

defs=0x0, defcount=0, closure=0x0) at Python/ceval.c:3265

23 0x00007f2665cdc852 in PyEval_EvalCode (co=0x2508930, globals=0x7fff6e1d2800, locals=0x7fff6e1d27f8) at Python/ceval.c:667

24 0x00007f2665cfd72a in PyRun_FileExFlags (fp=0x1ff13c0, filename=0x7fff6e1d80a3 "issue261.py", start=, globals=0x7f2666157168, locals=0x7f2666157168,

closeit=1, flags=0x7fff6e1d5d60) at Python/pythonrun.c:1371

25 0x00007f2665cfda22 in PyRun_SimpleFileExFlags (fp=0x1ff13c0, filename=0x7fff6e1d80a3 "issue261.py", closeit=1, flags=0x7fff6e1d5d60) at Python/pythonrun.c:949

26 0x00007f2665d134ec in Py_Main (argc=1712685216, argv=0x7fff6e1d5e78) at Modules/main.c:640

27 0x00007f2664fff1a6 in __libc_start_main () from /lib/libc.so.6

28 0x0000000000400679 in _start ()

(gdb)

jswhit commented 9 years ago

Using keepweakref=True when opening the Dataset eliminates the segfault for me.

import netCDF4 as nc
for i in xrange(1, 33):
    print(i)
    d = nc.Dataset('issue261.nc',keepweakref=True)

This suggests that the the garbage collector is not triggering the __dealloc__ Dataset method, and some internal data structures inside the HDF5 and/or netcdf library are overflowing when too many files are open. I guess there are two possible solutions:

1) figure out why the dataset is not going out of scope (where is the reference being kept?), fix that so the files do get closed.

2) file a netcdf bug report, since the segfaults should not happen when opening 33 files. This will require reproducing the segfault is a simple C program.

Of course, addressing both of these at the same time is probably a good idea.

jswhit commented 9 years ago

Of course, using the python context manager will also avoids the segfault (by making sure the file is closed).

import netCDF4 as nc
for i in xrange(1, 51):
    print(i)
    with nc.Dataset('issue261.nc') as f:
        print f

I have been unable to reproduce the problem in a simple C program (so far).

jswhit commented 9 years ago

The traceback provided by @jdemaria looks similar to one discussed on the h5py list:

https://groups.google.com/forum/#!msg/h5py/3v0oBQ3SVkk/qsCwQnfTxuEJ

jdemaria commented 9 years ago

Hi, thanks for your quick answer! I understand from the h5py discussion that the source of the problem is not in the NetCDF C library but a thread-bug in h5py, am I wrong?

jswhit commented 9 years ago

That's what it sounds like, but it happens for me even when OMP_NUM_THREADS=1. I may try recompiling hdf5 without threading enabled and see if that makes a difference.

jswhit commented 9 years ago

The segfault occurs even when hdf5 is compiled with the "threadsafe" option.

jswhit commented 9 years ago

Also occurs if "with nogil" wrapper around netcdf library calls is removed. So, it does not look to be a thread related issue.