netCDF4-python writes string (unicode) attributes as 1-d arrays, not scalars

shoyer commented 9 years ago

This code writes a single string attribute to an HDF5 file using netCDF4:

# Python 3.4.3
In [1]: import netCDF4

In [3]: ds = netCDF4.Dataset('/Users/shoyer/Downloads/global-attr.nc', 'w')

In [4]: ds.units = 'days since 1900'

In [5]: ds.close()

In [7]: !h5dump /Users/shoyer/Downloads/global-attr.nc
HDF5 "/Users/shoyer/Downloads/global-attr.nc" {
GROUP "/" {
   ATTRIBUTE "units" {
      DATATYPE  H5T_STRING {
         STRSIZE H5T_VARIABLE;
         STRPAD H5T_STR_NULLTERM;
         CSET H5T_CSET_ASCII;
         CTYPE H5T_C_S1;
      }
      DATASPACE  SIMPLE { ( 1 ) / ( 1 ) }
      DATA {
      (0): "days since 1900"
      }
   }
}
}

Here's code do to the same thing with h5py:

In [8]: import h5py

In [9]: f = h5py.File('/Users/shoyer/Downloads/global-attr-h5py.nc')

In [10]: f.attrs['units'] = 'days since 1900'

In [11]: f.close()

In [12]: !h5dump /Users/shoyer/Downloads/global-attr-h5py.nc
HDF5 "/Users/shoyer/Downloads/global-attr-h5py.nc" {
GROUP "/" {
   ATTRIBUTE "units" {
      DATATYPE  H5T_STRING {
         STRSIZE H5T_VARIABLE;
         STRPAD H5T_STR_NULLTERM;
         CSET H5T_CSET_UTF8;
         CTYPE H5T_C_S1;
      }
      DATASPACE  SCALAR
      DATA {
      (0): "days since 1900"
      }
   }
}
}

As you can see from the results of h5dump, netCDF4-python is writing the attribute as a "simple dataspace" which corresponds to a multi-dimensional array of 1-element: https://www.hdfgroup.org/HDF5/doc/UG/UG_frame12Dataspaces.html

In fact, this is exactly what you get if you view the file created with netCDF4-python using h5py (to netCDF4-python and ncdump, they appear identical):

In [13]: f = h5py.File('/Users/shoyer/Downloads/global-attr.nc')

In [14]: f.attrs['units']
Out[14]: array([b'days since 1900'], dtype=object)

I believe netCDF4-python should be writing the attribute as a scalar, similarly to want it does if you write bytes (or a string on Python 2):

# python 2.7
In [11]: ds = netCDF4.Dataset('/Users/shoyer/Downloads/global-attr-py27.nc', 'w')

In [12]: ds.bytes_str = 'days since 1900'

In [13]: ds.unicode_str = u'days since 1900'

In [14]: ds.close()

In [15]: !h5dump /Users/shoyer/Downloads/global-attr-py27.nc
HDF5 "/Users/shoyer/Downloads/global-attr-py27.nc" {
GROUP "/" {
   ATTRIBUTE "bytes_str" {
      DATATYPE  H5T_STRING {
         STRSIZE 15;
         STRPAD H5T_STR_NULLTERM;
         CSET H5T_CSET_ASCII;
         CTYPE H5T_C_S1;
      }
      DATASPACE  SCALAR
      DATA {
      (0): "days since 1900"
      }
   }
   ATTRIBUTE "unicode_str" {
      DATATYPE  H5T_STRING {
         STRSIZE H5T_VARIABLE;
         STRPAD H5T_STR_NULLTERM;
         CSET H5T_CSET_ASCII;
         CTYPE H5T_C_S1;
      }
      DATASPACE  SIMPLE { ( 1 ) / ( 1 ) }
      DATA {
      (0): "days since 1900"
      }
   }
}
}

Given that netCDF4-python is simply using the netCDF-C library's nc_put_att_string function, this may very well be a bug upstream in the netCDF-C library.

jswhit commented 9 years ago

Seems like when nc_put_att_text is used, the result is stored as a scalar in the hdf5 file. If nc_put_att_string is used (when the string is unicode) a simple dataspace is created. Here's the relevant code snippet in _netCDF4.pyx:

    if value_arr.dtype.char == 'U' and not is_netcdf3:
        # a unicode string, use put_att_string (if NETCDF4 file).
        ierr = nc_put_att_string(grp._grpid, varid, attname, 1, &datstring)
    else:
        ierr = nc_put_att_text(grp._grpid, varid, attname, lenarr, datstring)

I think you are right that this is due to how nc_put_att_string is implemented in the C library. It seems to be designed to write arrays of variable length strings.

shoyer commented 9 years ago

Should I open a bug report for the C library, then?

jswhit commented 9 years ago

Sure, wouldn't hurt. At the very least maybe we will found out why they chose to do it that way.

Unidata / netcdf4-python

netCDF4-python writes string (unicode) attributes as 1-d arrays, not scalars #448