OSGeo / gdal

GDAL is an open source MIT licensed translator library for raster and vector geospatial data formats.
https://gdal.org
Other
4.65k stars 2.46k forks source link

HDF5 support for compound datasets, character string datasets #1348

Open mkoohafkan opened 5 years ago

mkoohafkan commented 5 years ago

Using gdalinfo to list HDF5 subdatasets currently does not support compound or scalar string datasets. This is a limitation as certain spatial metadata such as labels, projections, etc. may be stored in these types of datasets. I presume that reading these types of datasets is not supported either. However, I find it interesting that gdalinfo is able to return string attributes of HDF5 groups just fine (although it does not return attributes for subdatasets) which suggests gdal already has some ability to handle strings.

Is the HDF5 driver still being supported? If so, is there any interest or capacity to expand the functionality of the HDF5 driver?

piyushrpt commented 5 years ago

Yes. I would be interested in assisting in expanding the functionality.

The HDF5 driver supports reading a specific type of compound type that mimics complex datasets written by h5py, i.e a structure with 2 entries of the same type.

mkoohafkan commented 5 years ago

Thanks @piyushrpt I think I see some of the code for supporting complex datasets here.

Thinking about the purpose of GDAL, I'm wondering how to best approach this. The goal would not be to just have general-purpose support for HDF5 files, but I think some wiggle room for pulling data from tables to support the various ways that spatial metadata can be stored in HDF files would be helpful.

piyushrpt commented 5 years ago

@mkoohafkan I agree with you that the implementation can be made more general. Here are a couple of things that I thought about but did not find enough time to implement and issue a PR. Maybe you and others can add to this list of thoughts:

  1. I think the best method might be to have a generic HDF5 driver - just like "raw" dataset driver and specializations be derived from it. The basic driver only interprets data types and provides read/write functionality for datasets and attributes.

  2. I see that there is Cosmo Skymed specific code baked into the driver. I think this is a useful functionality, but should probably be its own driver rather than be baked in to HDF5 driver. This could be a specialization of the generic HDF5 driver.

  3. I think there are quite a few projects that use CF conventions within HDF5 files (not netcdf only). That might be a good starting point for including spatial metadata. This could be a CF specialization of the HDF5 driver. Such data is already supported I believe by ESRI and software like panoply.

piyushrpt commented 5 years ago

One option that could be considered for compound datatypes is that each element is returned as a separate band of a dataset. For example, if a dataset is a 2D array of compound types - this could be interpreted as a GDAL dataset where each band has a different type. This model gets complicated with 3D arrays. One can access individual elements of a compound type as shown in this example - https://support.hdfgroup.org/ftp/HDF5/examples/misc-examples/chgfield.c

For the example in link above - band 1 would be Int32, band 2 would be Float64 and band3 Float32.

mdsumner commented 1 year ago

linking an old request/query for compound types with a specific interpretation

https://trac.osgeo.org/gdal/ticket/6551