enram / data-repository

Data quality assessment
https://enram.github.io/data-repository/
MIT License
3 stars 1 forks source link

Examine HDF5 format #13

Closed peterdesmet closed 8 years ago

peterdesmet commented 8 years ago

@bartaelterman @stijnvanhoey

adokter commented 8 years ago

I've uploaded a new example which will closely resemble the operational profile. One thing that will still likely change is the ordering of the datasetXX subfolders in the hdf5-tree.

The description of the hdf5 file has also been updated

bartaelterman commented 8 years ago

@adokter I don't seem to find a lot of metadata in these files. Is that correct? For instance dataset1/data1/what seems to be empty, so I'm not sure what the values mean.

adokter commented 8 years ago

You can use the h5dump command to list the structure of the hdf5 file. If you use HDFView GUI, you will find the what folder to be empty, because attributes do not show up by default. You have to do a Show properties on the folder.

Here what I get for the Jabbeke radar:

adriaan@MacAdriaan:~/git/ODIM-hdf5-test/vp$ h5dump -n 1 bejab_vp_20151009T0000Z.h5
HDF5 "bejab_vp_20151009T0000Z.h5" {
FILE_CONTENTS {
 group      /
 attribute  /Conventions
 group      /dataset1
 group      /dataset1/data1
 dataset    /dataset1/data1/data
 group      /dataset1/data1/what
 attribute  /dataset1/data1/what/gain
 attribute  /dataset1/data1/what/nodata
 attribute  /dataset1/data1/what/offset
 attribute  /dataset1/data1/what/quantity
 attribute  /dataset1/data1/what/undetect
 group      /dataset1/data10
 dataset    /dataset1/data10/data
 group      /dataset1/data10/what
 attribute  /dataset1/data10/what/gain
 attribute  /dataset1/data10/what/nodata
 attribute  /dataset1/data10/what/offset
 attribute  /dataset1/data10/what/quantity
 attribute  /dataset1/data10/what/undetect
 group      /dataset1/data11
 dataset    /dataset1/data11/data
 group      /dataset1/data11/what
 attribute  /dataset1/data11/what/gain
 attribute  /dataset1/data11/what/nodata
 attribute  /dataset1/data11/what/offset
 attribute  /dataset1/data11/what/quantity
 attribute  /dataset1/data11/what/undetect
 group      /dataset1/data12
 dataset    /dataset1/data12/data
 group      /dataset1/data12/what
 attribute  /dataset1/data12/what/gain
 attribute  /dataset1/data12/what/nodata
 attribute  /dataset1/data12/what/offset
 attribute  /dataset1/data12/what/quantity
 attribute  /dataset1/data12/what/undetect
 group      /dataset1/data13
 dataset    /dataset1/data13/data
 group      /dataset1/data13/what
 attribute  /dataset1/data13/what/gain
 attribute  /dataset1/data13/what/nodata
 attribute  /dataset1/data13/what/offset
 attribute  /dataset1/data13/what/quantity
 attribute  /dataset1/data13/what/undetect
 group      /dataset1/data14
 dataset    /dataset1/data14/data
 group      /dataset1/data14/what
 attribute  /dataset1/data14/what/gain
 attribute  /dataset1/data14/what/nodata
 attribute  /dataset1/data14/what/offset
 attribute  /dataset1/data14/what/quantity
 attribute  /dataset1/data14/what/undetect
 group      /dataset1/data15
 dataset    /dataset1/data15/data
 group      /dataset1/data15/what
 attribute  /dataset1/data15/what/gain
 attribute  /dataset1/data15/what/nodata
 attribute  /dataset1/data15/what/offset
 attribute  /dataset1/data15/what/quantity
 attribute  /dataset1/data15/what/undetect
 group      /dataset1/data2
 dataset    /dataset1/data2/data
 group      /dataset1/data2/what
 attribute  /dataset1/data2/what/gain
 attribute  /dataset1/data2/what/nodata
 attribute  /dataset1/data2/what/offset
 attribute  /dataset1/data2/what/quantity
 attribute  /dataset1/data2/what/undetect
 group      /dataset1/data3
 dataset    /dataset1/data3/data
 group      /dataset1/data3/what
 attribute  /dataset1/data3/what/gain
 attribute  /dataset1/data3/what/nodata
 attribute  /dataset1/data3/what/offset
 attribute  /dataset1/data3/what/quantity
 attribute  /dataset1/data3/what/undetect
 group      /dataset1/data4
 dataset    /dataset1/data4/data
 group      /dataset1/data4/what
 attribute  /dataset1/data4/what/gain
 attribute  /dataset1/data4/what/nodata
 attribute  /dataset1/data4/what/offset
 attribute  /dataset1/data4/what/quantity
 attribute  /dataset1/data4/what/undetect
 group      /dataset1/data5
 dataset    /dataset1/data5/data
 group      /dataset1/data5/what
 attribute  /dataset1/data5/what/gain
 attribute  /dataset1/data5/what/nodata
 attribute  /dataset1/data5/what/offset
 attribute  /dataset1/data5/what/quantity
 attribute  /dataset1/data5/what/undetect
 group      /dataset1/data6
 dataset    /dataset1/data6/data
 group      /dataset1/data6/what
 attribute  /dataset1/data6/what/gain
 attribute  /dataset1/data6/what/nodata
 attribute  /dataset1/data6/what/offset
 attribute  /dataset1/data6/what/quantity
 attribute  /dataset1/data6/what/undetect
 group      /dataset1/data7
 dataset    /dataset1/data7/data
 group      /dataset1/data7/what
 attribute  /dataset1/data7/what/gain
 attribute  /dataset1/data7/what/nodata
 attribute  /dataset1/data7/what/offset
 attribute  /dataset1/data7/what/quantity
 attribute  /dataset1/data7/what/undetect
 group      /dataset1/data8
 dataset    /dataset1/data8/data
 group      /dataset1/data8/what
 attribute  /dataset1/data8/what/gain
 attribute  /dataset1/data8/what/nodata
 attribute  /dataset1/data8/what/offset
 attribute  /dataset1/data8/what/quantity
 attribute  /dataset1/data8/what/undetect
 group      /dataset1/data9
 dataset    /dataset1/data9/data
 group      /dataset1/data9/what
 attribute  /dataset1/data9/what/gain
 attribute  /dataset1/data9/what/nodata
 attribute  /dataset1/data9/what/offset
 attribute  /dataset1/data9/what/quantity
 attribute  /dataset1/data9/what/undetect
 group      /how
 attribute  /how/beamwidth
 attribute  /how/clutterMap
 attribute  /how/comment
 attribute  /how/maxazim
 attribute  /how/maxrange
 attribute  /how/minazim
 attribute  /how/minrange
 attribute  /how/rcs_bird
 attribute  /how/sd_vvp_thresh
 attribute  /how/task
 attribute  /how/task_args
 attribute  /how/task_version
 attribute  /how/wavelength
 group      /what
 attribute  /what/date
 attribute  /what/object
 attribute  /what/source
 attribute  /what/time
 attribute  /what/version
 group      /where
 attribute  /where/height
 attribute  /where/interval
 attribute  /where/lat
 attribute  /where/levels
 attribute  /where/lon
 attribute  /where/maxheight
 attribute  /where/minheight
 }
}
stijnvanhoey commented 8 years ago

I tested the data-format of the hdf5 file in the following notebook: https://github.com/enram/infrastructure/blob/master/hdf5_handling/hdf5_check.ipynb

metadata can be easily extracted using existing python packages, such as h5py of pytables. Functions to extract the metadata/data are written as testcase in the notebook.

However, I'm just wondering why the individual files are all so small, which seems a drawback of using hdf5, having the capability of using with very large datasets. It connects to the discussion of collecting the data in a dbase or not. As individual files are so small, the creation of download-service or the ability to make queries will be an iteration over a lot of files when we only store metadata in a dbase.

I quickly checked, and maybe it could be interesting to think about some aggregation? In the notebook (last section), an aggregation to daily level is performed and the compliance with pandas provides an easy (daily) query. At the same time, doing this effort, we could just opt to put the data in a dbase.

adokter commented 8 years ago

The main reason for using ODIM hdf5 is that its the standard data exchange format at the meteorological datahubs, we simply need to conform to that specification if we want to integrate the bird product generation in the datahub.

The processing at the datahub is a simple file in - file out (so the source data is large, but the bird product is very small)

I have nothing against aggregation, but it can't happen at the meteorological datahub - would have to be implemented by us as an extra step

stijnvanhoey commented 8 years ago

Thanks for clarifying and I certainly do not want to question the original hdf5-file fomat. It is an important condition that should be taken into account. In function of the data-products (download request, services) for the users, there are different options. I'm wondering if it would be most useful to put all the data and metadata in a dbase for the download service or only the metadata as suggested in issue #4? In the latter situation, when a query is provided by a user, it would result in collecting data from a high number of small hdf5-files. What would you suggest?

adokter commented 8 years ago

I think having a dbase with all the data that can be queried would be very handy, but the decision also depends on feasibility, time and resources we have at the moment. Arguments against it are:

In an earlier discussion @peterdesmet suggested a directory tree and a service that shows what's available is therefore the more feasible option. But a dbase would be handier, because it's more flexible and you don't have to deal with a multitude of files anymore (which is a problem that remains, even if you aggregate to days).