Unidata / netcdf4-python

netcdf4-python: python/numpy interface to the netCDF C library
http://unidata.github.io/netcdf4-python
MIT License
754 stars 262 forks source link

h5netcdf: a new interface for writing netCDF4 files via h5py #390

Open shoyer opened 9 years ago

shoyer commented 9 years ago

This is not exactly a netCDF4-python issue (so feel free to close), but I thought users of this repo might be interested to test out my latest project, h5netcdf, an alternative interface for reading/writing netCDF4 as HDF5 files directly via h5py.

Feedback would be greatly appreciated!

My initial performance tests suggest that it generally has very similar performance to netCDF4-python, except for multi-threaded writes to a single file, for which it is about twice as fast (I tested against v1.1.6). I haven't tested compression yet.

rsignell-usgs commented 9 years ago

Does it read data from Opendap data sets also?

shoyer commented 9 years ago

@rsignell-usgs Nope, it can only do netCDF4/HDF5 files. But you could try pydap for that...

jswhit commented 9 years ago

Very cool - if you combine this with pupynere you could have a pure python implementation of the netcdf c library (except for the DAP part)! Just a couple of things I noticed in 15 minutes of testing.

1) on my system (macos x with hdf5 1.8.14) , I have to import h5py/h5netcdf before netCDF4. Otherwise, I get this error:

  File "issue371.py", line 2, in <module>
    from h5netcdf import Dataset
  File "build/bdist.macosx-10.10-x86_64/egg/h5netcdf/__init__.py", line 11, in <module>
  File "build/bdist.macosx-10.10-x86_64/egg/h5netcdf/core.py", line 7, in <module>
  File "/Users/jwhitaker/Library/Python/2.7/lib/python/site-packages/h5py-2.5.0-py2.7-macosx-10.10-x86_64.egg/h5py/__init__.py", line 23, in <module>
    from . import _conv
  File "h5py/h5t.pxd", line 14, in init h5py._conv (/Volumes/Drobo/python/h5py/h5py/_conv.c:6914)
  File "h5py/h5t.pyx", line 139, in init h5py.h5t (/Volumes/Drobo/python/h5py/h5py/h5t.c:20306)
  File "h5py/h5t.pyx", line 73, in h5py.h5t.lockid (/Volumes/Drobo/python/h5py/h5py/h5t.c:2514)
  File "h5py/h5t.pyx", line 42, in h5py.h5t.typewrap (/Volumes/Drobo/python/h5py/h5py/h5t.c:2148)
RuntimeError: Interface initialization failed (Not a datatype object)
Segmentation fault

This is not related to h5netcdf per se, but has something to do with the way the hdf5 lib is initialized in h5py I think.

2) Fill values are not implemented. Variables seem to be initialized with zeros by default.

3) As you note, unlimited dims are not yet implemented.

rsignell-usgs commented 9 years ago

Just for clarification, this isn't a pure python implementation exactly because it still depends on the HDF C library via h5py, right?

shoyer commented 9 years ago

Indeed, calling this pure Python is perhaps overly generous -- it does depend on the HDF5 C library. The implementation of netCDF4 on top of HDF5 is pure Python, though.

h5py already supports both fill values and unlimited dimensions, so that should be pretty easy to hook up.

shoyer commented 9 years ago

@jswhit as for your initialization issues, that does seem strange/unfortunate. I haven't encountered that personally on OS X (I'm using h5py and netCDF4 via conda).

jswhit commented 9 years ago

does h5py support "orthogonal indexing" with booleans and integers?

shoyer commented 9 years ago

@jswhit Yes and no. It doesn't do numpy-style broadcasting indexing so it's not inconsistent with orthogonal indexing, but it also does not support indexing with one than one array, e.g., v[[0, 1], [0, 1]] will raise an exception: http://h5py.readthedocs.org/en/latest/high/dataset.html#fancy-indexing

netCDF4-python can't really support this sort of indexing efficiently, either, so perhaps this is not such a terrible thing. Also, I'll be able to support orthogonal indexing with xray/h5netcdf using dask as an intermediate layer.

gamaanderson commented 9 years ago

Good work @shoyer , however I'm intrigued when you say its speed is similar to netCDF4-python, From my experience h5py is at least 3 times faster then netCDF4-python. Other concurrents (like Nio and scientific) are also generally faster, in some occasions for me Scientific.IO.NetCDF was up to 10 times faster. Please don't understand me wrong, I love netCDF4-python, it has some wonderful Ideas, but at least in my computer, it is really slow.

shoyer commented 9 years ago

@gamaanderson Interesting. I'm sure it depends on lots of specifics about your configuration and workflow. For example, I found that scipy.io.netcdf can be about twice as fast for reading netCDF3 files than netCDF4-python. So far, I've only tested reading and writing entire files at once, consisting of a single big array of floating point numbers without any compression, via the xray interface.

If you'd like to give h5netcdf a try, I would be interested to see the performance numbers on your workflows. I recently added support for fill value, so at least for basic usage (I haven't quite figured out how unlimited dimensions are stored in netCDF4 yet) it should be directly interchangeable with netCDF4-python -- I actually have tests to verify that you could do something like import h5netcdf as netCDF4.

gamaanderson commented 9 years ago

Couldn’t get it to work. First got a error at core.py line 38, but I overcame that by replacing size = v.size by size = v.shape[0]. (OBS: this happened in the case v was a dimension-variable) But then got error: File "h5netcdf/core.py", line 197, in _lookup_dimensions for axis, dim in enumerate(self._h5ds.dims): AttributeError: 'Dataset' object has no attribute 'dims' this happened in the first execution of this function.

Any way, could you say what software version you are using. I'm using: python-h5py 2.0.1-2+b1 libhdf5-7 1.8.8-9+b1 libnetcdfc7 1:4.1.3-6+b1 netcdf-bin 1:4.1.3-6+b1 (used nccopy to convert data to netCDF4)

shoyer commented 9 years ago

@gamaanderson I've been developing h5netcdf on h5py 2.4 and 2.5. You'll definitely need at least h5py 2.1, because that's the first release that included support for dimension scales (which are central to the netCDF4 data model). Looks like dataset.size also only appears in 2.1 as well. I'll add a note about minimum versions...

gamaanderson commented 9 years ago

Well I did get it to run, it reduce time from 19s to 5s, however I had more problems than expected:

  1. I could not use getattr(ncobj, k), had to change to ncobj.attrs[k]
  2. when ever i did ncvar[:] I got ValueError: Illegal slicing argument for scalar dataspace, so I need to change for ncvar._h5ds.value
  3. missed netCDF4.chartostring
  4. when closing the file got strange error: "Can't rename attribute (Record is already in b-tree)". I probably did something wrong with the file, changing something in place, that netCDF didn't mind, but h5py did. I have not investigate profoundly since this does not change the runtime estimation.

Unfortunately I will not be able to use in my main projects, its a cooperative work and there are some insisting in using the "official" lib. But it does corroborate with my opinion that netCDF does need to improve its performance.

shoyer commented 9 years ago

@gamaanderson This feedback is super helpful, thanks!

  1. This was mostly intentional -- I'm not a big fan of overloading attribute access in this somewhat ambiguous way. But if people are using it, then perhaps it's worth implementing.... at the very least this should be documented.
  2. It is arguably a bug in netCDF4-python that you can index a scalar array like ncvar[:]. This isn't allowed with numpy arrays, for example. The better way to index an array that may be scalar is to use ncvar[...].
  3. I could add in a little routine for this, though utility functions for working with netCDF files are generally outside my goals for this project.
  4. I agree, this is strange :).

I'm pleased to hear that you found h5netcdf to be so much faster for your use case. If you can't share your benchmark script, could you at least roughly summarize what it's doing? I would like to be able to reproduce these benchmarks myself...

gamaanderson commented 9 years ago

The project I used to test is (https://github.com/ARM-DOE/pyart), in special I just change function read_cfradial in cfradial.py. You may want to install the whole library, but I do think it would be quite simple to extract the relevant code from that file.

About number 1, I personally prefer to use .attrs[] to separate what is from the file and what is python, but yes some people are using direct attributes. About number 3 I also don't think its important, "".join() work just as well.

gamaanderson commented 9 years ago

I should also say, it is reading a NetCDF with the CfRadial convention totally into the memory, to a more practical structure.

CfRadial convention is for meteorological radar data in original, spherical, coordinates. A typical file has the following header