Unidata / netcdf4-python

netcdf4-python: python/numpy interface to the netCDF C library
http://unidata.github.io/netcdf4-python
MIT License
748 stars 261 forks source link

Bad performance when opening netCDF files with a lot of groups. #239

Open dierssen opened 10 years ago

dierssen commented 10 years ago

Hi Jeffrey, First of all: thanks for your library! We are working on a satellite observation project and work with netCDF. We work with big data files >10GB per file.

I now have the problem that when I have a file with a lot of groups, the netCDF library performance is not good.

I build 2 test programs of which one is based on the HDF5 Python library and the performance is much better:

[dierssen@machine test]$ ./groups_h5.py
File opened in write mode (0s)
Groups created (36s)
File closed (36s)
File opened in append mode (37s)
Groups created (37s)
[dierssen@machine test]$ ./groups_netcdf.py
File opened in write mode (0s)
Groups created (138s)
File closed (138s)
File opened in append mode (188s)
Groups created (203s)

I am not sure if you already know the existence of this issue, but is there any possibility how I could solve it? I prefer to use the netCDF library instead of the HDF5 library.

I hope to hear from you! Below I added the source code for both implementations.

Werner. I made more test programs and in most cases the HDF5 library was faster, but in the above case the difference was really big.

The netCDF code:

#!/usr/bin/env python
""" Create a netCDF file with 6000 groups and print some timing information."""
from __future__ import print_function
from __future__ import division

from netCDF4 import Dataset
import numpy as np
import numpy.random as rnd
import datetime as dt

def main():
    """ Main function. Called when module is run from the command line."""
    t0=dt.datetime.now()
    nc4_name = 'test_groups.nc'
    fid = Dataset(nc4_name, 'w')
    grp = fid.createGroup('BAND1')

    t1=dt.datetime.now()
    print('File opened in write mode ({}s)'.format((t1-t0).seconds))
    data = rnd.rand(1024, 512)
    for block in range(5000):
        block_name = ('ICID_99999_GROUP_%05d' % block)
        sgrp = grp.createGroup( block_name )
        sgrp.createGroup( 'GEODATA' )
        sgrp.createGroup( 'INSTRUMENT' )
        ssgrp = sgrp.createGroup( 'OBSERVATIONS' )
        sgrp.createDimension('row', size=1024)
        sgrp.createDimension('column', size=512)
        var = ssgrp.createVariable('signal', np.float32, dimensions=('row','column'))
        var[:] = data

    t1=dt.datetime.now()
    print('Groups created ({}s)'.format((t1-t0).seconds))
    fid.close()
    t1=dt.datetime.now()
    print('File closed ({}s)'.format((t1-t0).seconds))

    fid = Dataset( nc4_name, 'a' )
    t1=dt.datetime.now()
    print('File opened in append mode ({}s)'.format((t1-t0).seconds))
    grp = fid.groups['BAND1']
    for block in range(5000, 6000):
        block_name = ('ICID_99999_GROUP_%05d' % block)
        sgrp = grp.createGroup( block_name )
        sgrp.createGroup( 'GEODATA' )
        sgrp.createGroup( 'INSTRUMENT' )
        sgrp.createGroup( 'OBSERVATIONS' )

    t1=dt.datetime.now()
    print('Groups created ({}s)'.format((t1-t0).seconds))
    fid.close()

 if __name__ == '__main__':
    main()

The HDF5 code:

#!/usr/bin/env python
""" Create a HDF5 file with 6000 groups and print some timing information."""
from __future__ import print_function
from __future__ import division

import h5py
#import numpy as np
import numpy.random as rnd
import datetime as dt

def main():
    """ Main function. Called when module is run from the command line."""
    t0=dt.datetime.now()

    h5_name = 'test_groups.h5'
    fid = h5py.File( h5_name, 'w' )
    grp = fid.create_group( 'BAND1' )

    t1=dt.datetime.now()
    print('File opened in write mode ({}s)'.format((t1-t0).seconds))
    data = rnd.rand(1024, 512)

    for block in range(5000):
        block_name = ('ICID_99999_GROUP_%05d' % block)
        sgrp = grp.create_group( block_name )
        sgrp.create_group( 'GEODATA' )
        sgrp.create_group( 'INSTRUMENT' )
        ssgrp = sgrp.create_group( 'OBSERVATIONS' )
        #sgrp.create_dimension('row', size=1024)
        #sgrp.create_dimension('column', size=512)
        var = ssgrp.create_dataset('signal', (1024, 512), dtype='f')
        var[:] = data

    t1=dt.datetime.now()
    print('Groups created ({}s)'.format((t1-t0).seconds))
    fid.close()
    t1=dt.datetime.now()
    print('File closed ({}s)'.format((t1-t0).seconds))

    fid = h5py.File( h5_name, 'a' )
    t1=dt.datetime.now()
    print('File opened in append mode ({}s)'.format((t1-t0).seconds))

    grp = fid['BAND1']
    for block in range(5000, 6000):
        block_name = ('ICID_99999_GROUP_%05d' % block)
        sgrp = grp.create_group( block_name )
        sgrp.create_group( 'GEODATA' )
        sgrp.create_group( 'INSTRUMENT' )
        sgrp.create_group( 'OBSERVATIONS' )

    t1=dt.datetime.now()
    print('Groups created ({}s)'.format((t1-t0).seconds))
    fid.close()

if __name__ == '__main__':
    main()
shoyer commented 10 years ago

The obvious question (to me, at least) is why do you need to make so many groups? My guess is that HDF5/NetCDF4 is not optimized for creating lots of groups, and it would be much faster to just write everything to a handful of larger arrays.

In any case, the NetCDF4 file-format is not that complicated, so it should be totally possible to write a wrapper to write the NetCDF4 compatible files via the h5py library. I think that would actually be an interesting alternative to doing everything through the NetCDF-C library.

jswhit commented 10 years ago

The idea of reimplementing a python netcdf4 interface on top of h5py directly (bypassing the netcdf c library) has occured to me before, but I've never been sufficiently motivated to try it. I agree it would be an interesting project.

dierssen commented 10 years ago

Of course there are alternatives to using many groups (I could use more files), but 5000 groups should not give any problems. I was testing performance, since we need probably even more groups (100 x 100 measurement angles, per angle one group). Furthermore I stumbled on a 32768 number of groups limit in the netCDF-Python library. Because of these issues, I did make a netcdf4 interface based on h5py, which is faster. Still, I wanted to mention this issue. I prefer to use the netCDF library and I guess there must be a solution to these problems. But, I do have an alternative, so priority for me is low.

jswhit commented 10 years ago

There is no limit on the number of groups imposed at the python level - I guess you mean the C lib is limited to 32768 (NC_MAX_SHORT)? In netcdf.h I see:

/** Maximum for classic library.

In the classic netCDF model there are maximum values for the number of dimensions in the file (\ref NC_MAX_DIMS), the number of global or per variable attributes (\ref NC_MAX_ATTRS), the number of variables in the file (\ref NC_MAX_VARS), and the length of a name (\ref NC_MAX_NAME).

These maximums are enforced by the interface, to facilitate writing applications and utilities. However, nothing is statically allocated to these sizes internally.

These maximums are not used for netCDF-4/HDF5 files unless they were created with the ::NC_CLASSIC_MODEL flag.

As a rule, NC_MAX_VAR_DIMS <= NC_MAXDIMS. / /_@{/

define NC_MAX_DIMS 1024

define NC_MAX_ATTRS 8192

define NC_MAX_VARS 8192

define NC_MAX_NAME 256

define NC_MAX_VAR_DIMS 1024 /*< max per variable dimensions /

/*@}/

but nothing about groups. Even these limit only apply to files created with the classic data model.

dierssen commented 10 years ago

Hm, I still have to learn more about netCDF and its limits (I don't know if I use the NC_CLASSIC_MODEL flag), but if you replace in my test code the value 5000 with 10000, you will see that the netCDF Python code will crash. This is because each main group contains 3 subgroups, so in total you get 4 x 10000 groups. I found out that the max is 32768. The HDF implementation does support more than 32768 groups.

shoyer commented 10 years ago

@dierssen Would you be open to open sourcing your code? We would love to be to use and contribute to a NetCDF4 library based on top of h5py. If it would help, I would be very happy to help you polish it up or release it.

I'm still not entirely sure why you need to create 10,000 groups instead of a 100x100x1024x512 array. The later strikes me as much more flexible. If you have attributes that differ between groups you could save them in 100x100 arrays.

dierssen commented 10 years ago

@shoyer I only made a layer which writes away our structures into netCDF using h5py, but it is not a real library. Not really useful. Besides that, I will have license issues. And the netcdf4-python library works pretty well! Here, I do use a nice Python feature in which I have a runtime configurable base class. In that way, I can select between using the netcdf4-python library or the h5py library without using too much of duplicate code.

jswhit commented 10 years ago

I've made some changes to how Groups are created to remove some of the python overhead - creating new Groups is now about 4 times faster. Can you give my fork a try and see how it works for you (https://github.com/jswhit/netcdf4-python)?

jswhit commented 10 years ago

With the current jswhit/netcdf4-python fork, running your scripts I get:

./groups_h5.py File opened in write mode (0s) Groups created (124s) File closed (128s) File opened in append mode (128s) Groups created (128s)

./groups_netcdf.py File opened in write mode (0s) Groups created (164s) File closed (167s) File opened in append mode (219s) Groups created (221s)

So the performance is now in the ballpark, for creation at least. For opening, there is a lot more overhead in netcdf due to the need to populate the metadata.

dierssen commented 10 years ago

Great! I will test it on Tuesday. Here in the Netherlands the Easter weekend just started, so I have to prepare a family dinner :-). By the way, do you always have to populate the metadata? Is it not possible to do it when you open a group? I often have files with a lot of groups, while I then only need data of 1 or 2 groups.

jswhit commented 10 years ago

It is possible to implement 'lazy evaluation', in which the metadata would only be retrieved when it is accessed from python. Russ talked about implementing this at the C level also in the other ticket. It's a fair bit of work though, so it won't happen anytime soon.

Happy Easter - what's for dinner?

russrew commented 10 years ago

I just tested a pure C version of the same benchmark which has similar performance to the h5py I have from Anaconda:

./groups_nc
File opened in write mode (0.684 s)
Groups created (116 s)
File closed (118 s)
File opened in append mode (120 s)
Groups created (122 s)
File closed (123 s)

./groups_p5.py
File opened in write mode (0s)
Groups created (109s)
File closed (109s)
File opened in append mode (109s)
Groups created (110s)

So I can't duplicate a netCDF C library performance problem, at least with the current github release 4.3.2-rc2. In case you want to compare, here's my C code:

#include <stdlib.h>
#include <stdio.h>
#include <time.h>
#include <unistd.h>
#include <netcdf.h>

#define ERR do { \
fflush(stdout);  \
fprintf(stderr, "Unexpected result, %s, line: %d\n", \
    __FILE__, __LINE__);                    \
} while (0)

/* Return realtime clock in seconds from some epoch */
double
seconds() {
struct timespec ts;
int ret = clock_gettime(CLOCK_REALTIME, &ts);
double sec = ts.tv_sec + ts.tv_nsec/1.0e9 ;
return sec;
}

int
main(int argc, char *argv[]) {
double t0, t1;
char *nc4_name = "test_groups.nc";
int fid, grp, block;
#define NROWS (1024)
#define NCOLS (512)
#define NDATA (NROWS*NCOLS)
#define NDIMS (2)
float data[NDATA];
int i;
t0 = seconds();
if(nc_create(nc4_name, NC_NETCDF4, &fid)) ERR;
if(nc_def_grp(fid, "BAND1", &grp)) ERR;
t1 = seconds();
printf("File opened in write mode (%.3g s)\n", t1-t0);
for(i = 0; i < NDATA; i++)
    data[i] = rand();
for (block = 0; block < 5000; block++) {
    char block_name[25];
    int sgrp, ssgrp, rowdim, coldim, dims[2], var;
    sprintf(block_name, "ICID_99999_GROUP_%05d", block);
    if(nc_def_grp(grp, block_name, &sgrp)) ERR;
    if(nc_def_grp(sgrp, "GEODATA", NULL)) ERR;
    if(nc_def_grp(sgrp, "INSTRUMENT", NULL)) ERR;
    if(nc_def_grp(sgrp, "OBSERVATIONS", &ssgrp)) ERR;
    if(nc_def_dim(sgrp, "row", NROWS, &dims[0])) ERR;
    if(nc_def_dim(sgrp, "column", NCOLS, &dims[1])) ERR;
    if(nc_def_var(ssgrp, "signal", NC_FLOAT, NDIMS, dims, &var)) ERR;
    if(nc_put_var_float(ssgrp, var, data)) ERR;
}
t1 = seconds();
printf("Groups created (%.3g s)\n", t1-t0);
if(nc_close(fid)) ERR;
t1 = seconds();
printf("File closed (%.3g s)\n", t1-t0);

if(nc_open(nc4_name, NC_WRITE, &fid)) ERR;
t1 = seconds();
printf("File opened in append mode (%.3g s)\n", t1-t0);
if(nc_inq_ncid(fid, "BAND1", &grp)) ERR;
for (block = 5000; block < 6000; block++) {
    char block_name[25];
    int sgrp, ssgrp, rowdim, coldim, dims[2], var;
    sprintf(block_name, "ICID_99999_GROUP_%05d", block);
    if(nc_def_grp(grp, block_name, &sgrp)) ERR;
    if(nc_def_grp(sgrp, "GEODATA", NULL)) ERR;
    if(nc_def_grp(sgrp, "INSTRUMENT", NULL)) ERR;
    if(nc_def_grp(sgrp, "OBSERVATIONS", NULL)) ERR;
}
t1 = seconds();
printf("Groups created (%.3g s)\n", t1-t0);
if(nc_close(fid)) ERR;
t1 = seconds();
printf("File closed (%.3g s)\n", t1-t0);
return(0);
}
shoyer commented 10 years ago

@dierssen OK, no worries. If I do work on h5py netCDF4 implementation I'll definitely update this issue to let you know.

dierssen commented 10 years ago

Hi Jeff,

I tested the new library. Looks good. Here my results with the old library (i removed the line which actually writes data to the variables, so all time is spent in setting up the file structure):

[dierssen@server test]$ ./groups_netcdf.py
File opened in write mode (2s)
Groups created (38s)
File closed (44s)
File opened in append mode (92s)
Groups created (107s)

And the results with the new library:

(myVE)[dierssen@server myVE]$ ./groups_netcdf.py
File opened in write mode (2s)
Groups created (16s)
File closed (21s)
File opened in append mode (66s)
Groups created (68s)

I still think that a total time of 70 seconds to create, read and append a file of only 12MB could be improved, but I can imagine of course that lazy evaluation takes much more time to implement.

@russrew I would also expect a C program to be faster than a Python program in most cases (h5py compared to C netCDF), but I guess there are a lot of factors which play a role. Here we use netcdf-4.3.1.1.

Ahh and here in the Netherlands we ate stamppot during Easter: http://www.stamppotrecepten.net/. Give it a try, I would say :-D.

Greetings Werner. Some tests failed while installing, I attached the report:

(myVE)[dierssen@server test]$ python run_all.py
not running tst_unicode3.py ...
.....E......E.......................................
======================================================================
ERROR: runTest (tst_dap.DapTestCase)
testing access of data over http using opendap
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/dierssen/virtualenv-1.11.4/myVE/netcdf4-python-master/test/tst_dap.py", line 22, in     runTest
    ncfile = netCDF4.Dataset(URL)
  File "netCDF4.pyx", line 1442, in netCDF4.Dataset.__init__ (netCDF4.c:19901)
RuntimeError: NetCDF: Unknown file format

======================================================================
ERROR: test_select_nc (tst_netcdftime.TestDate2index)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/dierssen/virtualenv-1.11.4/myVE/netcdf4-python-master/test/tst_netcdftime.py",     line 342, in test_select_nc
    millisecs = int(date2num(d,unix_epoch,calendar='proleptic_gregorian'))
  File "utils.pyx", line 124, in netCDF4.date2num (netCDF4.c:4013)
  File "utils.pyx", line 13, in netCDF4._dateparse (netCDF4.c:2890)
ImportError: dateutil module required for accuracy < 1 second

----------------------------------------------------------------------
Ran 52 tests in 12.809s

FAILED (errors=2)

netcdf4-python version: 1.0.9
HDF5 lib version:       1.8.10
netcdf lib version:     4.3.1.1
jswhit commented 10 years ago

Werner pointed out that the number of groups that can be created appears to be limited to 32767. The netcdf docs mention a limitation for classic formatted files, but not for HDF5 formatted files. Is this limit expected, or is the unintended consequence of a unsigned short integer being used somewhere in the library? Here's a modified version of Russ's C program that demonstrates the problem:

#include <stdlib.h>
#include <stdio.h>
#include <unistd.h>
#include <netcdf.h>

#define ERR do { \
fflush(stdout);  \
fprintf(stderr, "Unexpected result, %s, line: %d\n", \
    __FILE__, __LINE__);                    \
} while (0)

int
main(int argc, char *argv[]) {
double t0, t1;
char *nc4_name = "test_groups.nc";
int fid, grp, block;
#define NROWS (1024)
#define NCOLS (512)
#define NDATA (NROWS*NCOLS)
#define NDIMS (2)
float data[NDATA];
int i;
if(nc_create(nc4_name, NC_NETCDF4, &fid)) ERR;
if(nc_def_grp(fid, "BAND1", &grp)) ERR;
printf("File opened");
for(i = 0; i < NDATA; i++)
    data[i] = rand();
for (block = 0; block < 32768; block++) {
    char block_name[25];
    int sgrp, ssgrp, rowdim, coldim, dims[2], var;
    sprintf(block_name, "ICID_99999_GROUP_%05d", block);
    printf("Block %d\n",block);
    if(nc_def_grp(grp, block_name, &sgrp)) ERR;
    if(nc_def_dim(sgrp, "row", NROWS, &dims[0])) ERR;
    /*if(nc_def_grp(sgrp, "GEODATA", NULL)) ERR;
    if(nc_def_grp(sgrp, "INSTRUMENT", NULL)) ERR;
    if(nc_def_grp(sgrp, "OBSERVATIONS", &ssgrp)) ERR;
    if(nc_def_dim(sgrp, "row", NROWS, &dims[0])) ERR;
    if(nc_def_dim(sgrp, "column", NCOLS, &dims[1])) ERR;
    if(nc_def_var(ssgrp, "signal", NC_FLOAT, NDIMS, dims, &var)) ERR;
    if(nc_put_var_float(ssgrp, var, data)) ERR;*/
}
printf("Groups created");
if(nc_close(fid)) ERR;
printf("File closed");
return(0);
}

When it runs, all is well until the 32767'th group, and then

Block 32765
Block 32766
Unexpected result, grpmax.c, line: 34
Block 32767

Line 34 is where the dimension is being created inside the group.

russrew commented 10 years ago

It's an undocumented consequence of the netCDF-4 C implementation (which I didn't write, so won't defend). I'll add some documentation today, so thanks for pointing out the omission.

The C library combines the file ID and group ID into a single integer used as a location ID, to save one argument in most of the functions associated with groups. The implementation allocates only the lower 16 bits for the group ID, and thus also forces a limit of at most 32767 open netCDF files. It looks like internally a signed short is used to hold the next group ID, which means it actually only uses 15 of the available 16 bits. I might try changing this to an unsigned short to get another factor of 2 for number of groups. If that breaks anything, I don't think it would be worth a lot of modifications for the extra factor of 2 ...