memory leak when opening netCDF files which contain compound data structures

dierssen commented 10 years ago

Hi Guys, I have a problem. We are doing measurements and each measurement results in a netCDF file. Now I am trying to read in these files one by one (open and close). After reading around 2000 files (nc.Dataset(filename,'r')), the read process gets really slow. It needs a minute to read in a file, while it normally needs half a second.

I debugged a lot of stuff and pinned the problem down. It only happens when I have compound data in my files. This compound data seems to generate a memory leak, which is pretty big in our case. It looked a little bit related to http://netcdf-group.1586084.n2.nabble.com/Re-netcdf-4-open-close-memory-leak-td3155740.html but I am not sure if it's the same problem. I do not know if the memory leak results in the performance decrease of reading in a file, but I am sure it must be related.

Anyway, I cannot attach our files, but I made a piece of code, which actually reproduces the problem pretty easy (tested it on more machines, everywhere same result). The code is below. The problem shows up after a few minutes. Can anybody help me on this issue? I know it could also be an issue in the netCDF C library.

Greetings Werner.

import netCDF4 as nc
import numpy as no
import time

#create test file
filename = "test.nc"
nc_file = nc.Dataset(filename, 'w')
nc_file.createDimension('nRows', 1024)
nc_file.createDimension('nColumns', 512)
for group_nr in range(10):
    nc_group = nc_file.createGroup("group{}".format(group_nr))
    complex128 = np.dtype([('real',np.float64),('imag',np.float64)])
    complex128_t = nc_group.createCompoundType(complex128,'complex128')
    for nr in range(20):
        cheb_data = np.ndarray(shape=(1024,512), dtype=complex128_t)
        cheb_data['real'] = np.ones((1024,512)) * 8.0
        cheb_data['imag'] =np.ones((1024,512)) * 5.0
        var = nc_group.createVariable("data{}".format(nr), complex128_t, dimensions=('nRows', 'nColumns',))
        var[:] = cheb_data
nc_file.close()

#read test file number of times
index = 0

for i in range(12000):
    start_time = int(round(time.time() * 1000))
    ds_root = nc.Dataset(filename, 'r')
    ds_root.close()
    stop_time = int(round(time.time() * 1000))
    index+=1
    print("{}: {}".format(index, stop_time-start_time))

jswhit commented 10 years ago

When I run your test code, I don't see any trend in the time intervals, but I do get a segfault after about 2959 iterations. It's quite possible there is a memory leak in the python module, or the netcdf C library. Compound data types have been not yet seen widespread use in the netcdf community, so it's pretty lightly tested.

2958: 72
2959: 54
Traceback (most recent call last):
  File "issue279.py", line 28, in <module>
    ds_root = nc.Dataset(filename, 'r')
  File "netCDF4.pyx", line 1466, in netCDF4.Dataset.__init__ (netCDF4.c:19913)
    raise RuntimeError((<char *>nc_strerror(ierr)).decode('ascii'))
RuntimeError: NetCDF: HDF error

Memory leaks are often pretty hard to debug, so don't expect a quick fix for this one. It would be best if we would reproduce this with a simple C program, and then open a ticket in netcdf-c (assuming it's a bug in the library).

dierssen commented 10 years ago

I made a netCDF C program and I got the same slowdown trend. However, I did not get the memory leakage. Actually, the Dataset class does much more than only opening a netCDF file, so to get a realistic comparison I should do the same stuff in my C program as the Python library does, but unfortunately I do not have time for this. I added the C code below.

I forgot to mention, but we use Python 2.7.3, numpy 1.7.0 and netCDF4 1.0.8 (ncdf4lib: 4.3.1.1, hdf5lib: 1.8.10).

For now, we have to switch to h5py because of this performance problem.

#include <stdlib.h>
#include <stdio.h>
#include <netcdf.h>

#define FILE_NAME "test.nc"

int main()
{
    int retval;
    int ncid;
    int grp_ncid;
    int index;
    int rh_id;

    struct s1 {
      double i1;
      double i2;
    };
    struct s1 compound_data[1024][512];

    for ( index = 0; index < 8000; index++ ) {

        retval = nc_open( FILE_NAME, NC_NOWRITE, &ncid );

        (void) printf( "attempt %d\n", index);

        retval = nc_inq_ncid(ncid, "group0", &grp_ncid);
        retval = nc_inq_varid (grp_ncid, "data0", &rh_id);
        retval = nc_get_var(grp_ncid, rh_id, &compound_data[0][0]);

        printf("i1=%f\n", compound_data[0][0].i1);

        retval = nc_close( ncid );
    }

    return retval;
};

jswhit commented 10 years ago

Thanks for doing this. One last request - would you mind opening a github issue at Unidata/netcdf-c, including a description of the problem and your sample C code? Sorry you won't be able to use netcdf4-python for your project - I hope we can get this fixed so it will meet your needs some time in the future.

dierssen commented 10 years ago

Ok, I will do that. I was still thinking about your segfault. We have pretty big machines here with 128GB working memory. I can imagine that you have less memory available. That would result in a crash.

jswhit commented 10 years ago

Here's a link to the netcdf-c issue:

https://github.com/Unidata/netcdf-c/issues/73

Thanks for creating that issue Werner, hopefully one of the Unidata developers will have some insight.

jswhit commented 10 years ago

I'm still getting an 'HDF Error' after 2959 reads, although I don't see any slowdowns.

The 'HDF Error' is coming from the C side, so it's still not clear to me whether there is anything that needs to be fixed on the python side.

dierssen commented 10 years ago

Hm. We have a lot of memory. I can imagine that your system just rans out of memory, and that my system slows down because of the high memory usage. I guess there is an issue in both the C and the Python library. But it is a tricky one and I can imagine people dont have time for this. We are now also benchmarking our C netCDF and HDF code. I will let you know if we find something of interest. You could perhaps try to see what happens when you skip reading the compound data structs in your library. You can than check if my code snippet still gives the HDF error.

Verstuurd vanaf mijn iPhone

Op 13 aug. 2014 om 17:14 heeft "Jeff Whitaker" notifications@github.com het volgende geschreven:

I'm still getting an 'HDF Error' after 2959 reads, although I don't see any slowdowns.

The 'HDF Error' is coming from the C side, so it's still not clear to me whether there is anything that needs to be fixed on the python side.

— Reply to this email directly or view it on GitHub.

jswhit commented 10 years ago

I'm not running out of memory - ps reports less than 1Gb used when the process stops with an 'HDF error'. The error occurs even if I don't ready any data from the file, just open it and close it.

jarethholt commented 9 years ago

Have there been any updates to this issue? I'm running into it (well, the same HDF5 error at least). I'm reading data from a lot of CSV files and saving to a single netCDF, so I'm not opening and closing the file in between. Python also never takes up more than 50MB of memory, and all processes total ~2.5 GB out of 8, so I don't think it's a memory issue. (The final variable should have total size ~ 160x1826x14 ~ 4 million floats ~ 250 MB, so it's odd that the python process isn't taking up more memory, actually.)

I can't post all the CSV files and I'm starting on a minimal working example, but here's my script: https://gist.github.com/jarethholt/b7dfd4f5c4980faf1374

Unidata / netcdf4-python

memory leak when opening netCDF files which contain compound data structures #279