aodn / data-services

Scripts which are used to process incoming data in the data ingestion pipeline
GNU General Public License v3.0
1 stars 4 forks source link

Need a unified source of standard netCDF attributes #525

Open mhidas opened 8 years ago

mhidas commented 8 years ago

Many of our processes that create netCDF files require templates to set global (and variable) attributes. Currently this is being done in a variety of different ways, from multiple sources, often setting the same basic attributes (project, acknowledgements, etc..):

There may be others too...

There are two issues here:

  1. We have redundant code doing the same thing in different ways.
  2. We have redundant versions of the same standard global attributes in several locations.

It would be helpful to come up with a solution that removes, or at least minimises, both of these issues.

mhidas commented 8 years ago

In my opinion CDL would be the most appropriate format to use for these templates. These can be very easily converted into a netCDF file with ncgen and then opened as a NetCDF4 Dataset object.

As for the location, something like data-services/lib/netcdf/templates would make sense. The tricky bit will be how to make this available to the Matlab Toolbox as well...

lbesnard commented 8 years ago

converted into a netCDF file with ncgen

the only issue with this is we would have to do a subprocess call from python, which is never an ideal solution.

mhidas commented 6 years ago

Format options for storing netCDF attributes

There are many ways fixed attribute values (such as the license, disclaimer and acknowledgement strings required by the IMOS conventions) could be stored. The idea is to keep these separate from code that creates netCDF files. Here are some options

CDL (Common Data Language)

Currently, IMOSnetCDF.py in data-services uses text files like this, which are not complete CDL files, but they store global and variable attributes the same way as CDL.

Pros

conf (Windows INI syntax)

e.g. https://github.com/aodn/data-services/blob/master/SOOP/SOOP_XBT/DELAYED/generate_nc_file_att

Pros

json

Pros

IMOS Toolbox

e.g. global attributes: https://github.com/aodn/imos-toolbox/blob/master/NetCDF/template/global_attributes_timeSeries.txt variable attributes: https://github.com/aodn/imos-toolbox/blob/master/IMOS/imosParameters.txt

Pros

Python

Attribute values can be specified directly in a Python module and simply imported, e.g. https://github.com/aodn/data-services/blob/master/ACORN/current_generator/acorn_constants.py

Pros

Other possibilities

mhidas commented 6 years ago

@smancini @lbesnard @ggalibert @bpasquer To be discussed.

ocehugo commented 6 years ago

@mhidas,

I've a code that read cdl from python.

It need some refactoring into regex (its pretty primitive). It doens't support groups, but there are a battery of tests from the unidata website cdls and is pure python using only the re module.

It works for my needs but I would like to extend with more regex and support groups.

Let me know if you wanna have a go.

PS: there is also TOML.

lbesnard commented 6 years ago

https://github.com/rockdoc/cdlparser https://github.com/powellb/seapy/blob/e8fcdcc72fe9203e36e357e1fde43270176d54d6/seapy/cdl_parser.py

lbesnard commented 6 years ago

After doing a bit of testing, I'm of the opinion that the https://github.com/rockdoc/cdlparser code could be what we are after. It parses CDL files, generates a netcdf file or keep the netcdf object open in python.

example:

1) in bash

ncdump http://thredds.aodn.org.au/thredds/dodsC/IMOS/SOOP/SOOP-SST/9HA2479_Pacific-Sun/2010/IMOS_SOOP-SST_MT_20101212T120000Z_9HA2479_FV01_C-20120528T071954Z.nc > /tmp/soop.cdl

2) in python

from cdlparser import CDL3Parser

myparser = CDL3Parser(close_on_completion=False, file_format='NETCDF4_CLASSIC')
ncdataset = myparser.parse_file("/tmp/soop.cdl",ncfile="/tmp/soop.nc")
ncdataset.close()

However the generated netcdf file and the orginal cdl ends up missing some attributes (and maybe other things). The code would have to be forked an improved

bpasquer commented 6 years ago

When I was working on generating files with the netcdfgenerator 2 years ago, I remember having issues to generate NetCDF4 file using CDL It's probably been improved since then and so If the cdlparser meets our needs like @lbesnard is suggesting (but further tests are needed), using CDL is the best option in my opinion.

ocehugo commented 6 years ago

if you are going to use https://github.com/rockdoc/cdlparser, check if they implemented python3 support.

I did a pr to support python 3 sometime ago. It didn't got into upstream because my pr was disorganized . My ode already worked fine for my case, so i didn't push forward. I couldn't remember much why I did a new code...maybe something cdlparser was failing, or I was not aware of it at the time, or I was too optmistic to finish my code with groups support and all the bells.

Anyway, I don't see a problem with raw python code either, but json is a clear win if you plan to read those things over the wire.

lbesnard commented 6 years ago

re https://github.com/rockdoc/cdlparser, I was having issues with the generated file missing many attributes, data. This is because I used file_format='NETCDF4_CLASSIC' instead of file_format='NETCDF4'. Apparently a bug from the netcdf library.

So, doing the following (see below), will create a correct NetCDF file from a CDL format. Will keep on testing with other datasets:

1) in bash

ncdump http://thredds.aodn.org.au/thredds/dodsC/IMOS/SOOP/SOOP-SST/9HA2479_Pacific-Sun/2010/IMOS_SOOP-SST_MT_20101212T120000Z_9HA2479_FV01_C-20120528T071954Z.nc > /tmp/soop.cdl

2) in python

from cdlparser import CDL3Parser

myparser = CDL3Parser(close_on_completion=False, file_format='NETCDF4')
ncdataset = myparser.parse_file("/tmp/soop.cdl", ncfile="/tmp/soop.nc")
ncdataset.close()

3) bash, to check the diff between the new netcdf and the cdl file

diff /tmp/soop.cdl < (ncdump /tmp/soop.nc)
lbesnard commented 6 years ago

check if they implemented python3 support.

No They didn't but we use python 2.7 anyway

mhidas commented 6 years ago

We do want to eventually move to python 3 though! (Support for Python 2 ends in a year and a half, and many packages we use will stop supporting it before then - see e.g. https://python3statement.org/)

ggalibert commented 6 years ago

My preference would go for CDL, especially if we're only hard coding the content of the attributes.

But if we want to be a bit more flexible and allow for external resources to document some attributes (like when the toolbox tries to inject information from a deployment database or a Matlab expression) then JSON might be neater. @jonescc correct me if I'm wrong but I think this is what you are already doing in gogoduck and netcdf generator?

jonescc commented 6 years ago

The netcdf generator and gogoduck use their own xml format. Geoserver uses xml for its configuration and this is where they were located at the time. We had to write support for translating those formats into netcdf attributes which you would also have to do if you don't use an existing mechanism such as cdl or ncml.

mhidas commented 6 years ago

After chatting to @ocehugo and playing around in Python, I am now only half-convinced that CDL is the way to go.

For just global attributes there's no problem, you can have a valid CDL (and equivalend netCDF) file containing just global attributes. However, we also want to store variable attributes, and to put them in a valid CDL file, we need to define the variables themselves, complete with data type and dimensions. We can do that too, though this is a bit overkill when we just want to specify a few attributes.

More importantly, when creating these templates, we don't actually know what the exact structure of the final file will be, or at least the size of the dimensions. So we can't just read a CDL template straight into a netCDF4.Dataset object (as cdlparser does), add, data, then save to netCDF. Before we can add data, we have to get the dimensions right, but once a dimension or variable is defined in a Dataset object, you can't change its structure.

Instead, the workflow needs to be something like this:

[template file] => [intermediate Python object]
                => set correct dimension sizes and variable dimensions
                => convert to netCDF4 object
                => add data arrays
                => write file

The obvious [intermediate Python object] would be a dictionary (with nested dicts for each variable), which could either be defined in Python code, or in json.

ghost commented 6 years ago

If you need to manipulate it from Python, then it's kind of a moot point with JSON vs. dict... at that point JSON is effectively a serialised dict and a dict is a deserialised JSON object.

we need to defile the variables themselves

We definitely need to avoid defiling things if possible, but that Freudian slip kind of leads onto the next thought I had... you can complement a JSON structure with http://json-schema.org/, as it says on the page it "Describes your existing data format(s)."

Having totally unconstrained JSON is just a big bag of keys and values which is a recipe for bugs. You can't get particularly rich types in JSON, but you can at least avoid totally arbitrary data structures.

ggalibert commented 6 years ago

The other advantage about using JSON or XML as opposed to CDL is that you could define the file format (NETCDF3, NETCDF4, etc), chunking and compression level per variable.

ocehugo commented 6 years ago

@ggalibert, this is the exact reason why I wrote a class-wrapper for writing netCDF4.Datasets (that sit below the cdlreader).

The Class, DictDataset, is initialized with 3 different dictionaries: dimensions, variables and global attributes. It postpones the actual netCDF Dataset creation to the "create" method of it. You can define variable dimensions as keys in the other dictionary, as well as chunking/compression per variable. It's useful because everything is within a single dictionary structure that can be reused or even summed with other DictDatasets (the class has the "add" method).

@mhidas saw it in action, and it requires only some small fixes/changes depending on how the template is to be defined:

atemplate.py => x = DictDataset(from_file=atemplate.py)
                => set correct dimension sizes and variable dimensions (x.dimensions['x'] = X) #if 'x' in template is different from X
                => convert to netCDF4 object (x.set_file(outputfile),x.create())
                => add data arrays (x.ncobj[varname][:] = var)
                => write file (x.ncobj.sync())
mhidas commented 6 years ago

Basic functionality of new NetCDF writer module

To support the workflow proposed above, we will create a new Python package for writing netCDF files using templates. If @ocehugo is happy to contribute his code, it could be based on his DictDataset class. At the minimum, it will need to implement the basic functionality described below.

Note that reading in or creating the data values is outside the scope of this package. The most convenient way to provide the data would be in numpy arrays or a Pandas dataframe.

Read template

Read one or more template files and return a dictionary-like template object (e.g. DictDataset) representing the file structure and attributes. The template file format will be JSON (optionally readers for other formats could be implemented). e.g.

template = DictDataset(from_file='template.json')

Update template

Update dimensions, variables and attributes in a template object. This should be as simple as adding or updating entries in a dictionary. e.g.

template.dimensions['TIME'] = 100
template.variables['PRES']['units'] = 'dbar'
template.title = 'Test dataset'

Create netCDF object

Create a netCDF4.Dataset object and add the dimensions, variables and attributes specified in the template. e.g.

template.create(filename)   # user-specified file name
template.create()           # auto-generated file name using IMOS conventions

This could actually be called automatically at the start of the add_data method, if it hasn't been explicitly called yet.

Add data

Add values from numpy arrays or a Pandas dataframe into the variables defined in the template. This is already done in several exising bits of data-services/aodndata code, but they could be slightly streamlined by offering a single function to do it. If the column names in a dataframe match the variables in the template, the code can match them up automatically. Otherwise each individual data array will need to be specified separately. e.g.

template.add_data(dataframe)
template.add_data(TIME=time_values, TEMP=temp_values, PRES=pres_values)

Write file

Close the netCDF object to finish writing the file.

ghost commented 6 years ago

The classmethod alternate constructor pattern would be a good fit to decouple the object from the source format, e.g.

template1 = DictDataset(source_dict)
template2 = DictDataset.from_json(path='template.json')
template3 = DictDataset.from_ini(path='template.ini')
ggalibert commented 6 years ago

Could we still add data by just doing:

template.variables['PRES'] = pres_values

?

ocehugo commented 6 years ago

I'm happy to change the Dictdataset code I have.

I think the first thing to raise is some testing cases (json/ini/cdl) .

@mhidas, can you pull out some (or all) the aodn templates already in use? I assumed we would like to avoid rewriting some of them to avoid breaking from the start and go moving them slowly to json.

The DictDataset at the moment accepts 3 dict inputs (dims, vars,global_att). Is this way because it was the way my cdlreader output things from a valid CDL.

Some things in my mind now:

  1. You guys think its better to have a single input dict to rule them all!? This would match json...

    template1 = DictDataset(source_dict)

    or

    template1 = DictDataset(dims=d_dims,vars=d_vars,gattr=d_gattr)
  2. "append mode" to an already created NetCDF file is required in the the short term? This would force us to provide a template from a netCDF4.Dataset object:

template = DictDataset.from_dataset(path='file.nc')
  1. I think delay the creation of the Dataset after add_data method is a better strategy. This would allow leaving dimensions specification to be later evaluated plus other things. Just store the ata as a "value" key (validation would be at create() anyway).

  2. backends: netCDF4 is assumed from start. Can't see it going anytime soon, but h5netcdf could be an option!?

ghost commented 6 years ago

You guys think its better to have a single input dict to rule them all!? This would match json...

Doesn't really matter in terms of the design pattern, both would work depending on what makes sense to you guys. Ideally the regular init could just take native Python dicts to construct the instance, and the from_dict/from_dataset etc. methods can then basically be wrappers which retrieve/transform things into dicts to feed into init, e.g:

class DictDataset(object):
    def __init__(self, d_dims, vars, gattr):
        pass

    @classmethod
    def from_json(cls, path):
        # load the JSON file into a dict
        with open(path) as f:
            template = json.load(f)

        # e.g. this could call out to JSONschema to make sure the JSON has the expected 
        # high level structure, and  could refuse to create the object right here if it wasn't correct
        validate_template(template)  

        # instantiate using the regular __init__ method
        return cls(d_dims=template['d_dims'], vars=template['vars'], gattr=template['gattr'])

So if you want to instantiate from a Python context where you have dicts already, you just use the regular init, e.g.

d_dims = dict()
vars = dict()
gattr = dict()

my_dataset = DictDataset(d_dims, vars, gattr)

Or if you want to start from a JSON file you use the classmethod constructor and ultimately end up with the same object:

template.json

{
    "d_dims": {},
    "vars": {},
    "gattr": {}
}
my_dataset = DictDataset.from_json(path='template.json')

In any case, this pattern totally decouples the source format from the Python object, because after initialisation it makes zero difference where the original parameters came from.

It's great for creating different flavours of pizza too: https://realpython.com/instance-class-and-static-methods-demystified/#delicious-pizza-factories-with-classmethod

mhidas commented 6 years ago

template.variables['PRES'] = pres_values

Not quite. template.variables['PRES'] needs to be a dictionary, so it can store the variable's type, dimensions, and attribute values. We could replicate the netCDF4 interface, so you can do this:

template.variables['PRES'][:] = pres_values
mhidas commented 6 years ago

You guys think its better to have a single input dict to rule them all!?

That's what I was thinking, but it probably doesn't really matter. I think most of the code would be reading the template from a file, so it wouldn't actually need to construct the One dict (or call the constructor with three dicts in the other case). I guess having the init() accept three separate dicts makes it easier to create a simple template that specifies e.g. only global attributes:

template = DictDataset(gattr={'title': 'test file', 'author': 'me'})

By the way, I think we should use ordered dicts, so the template can specify the order they are written to file.

"append mode" to an already created NetCDF file is required in the the short term?

I don't think that's needed for our main use case. Could be something to add later.

I think delay the creation of the Dataset after add_data method is a better strategy.

Yeah that could work. I guess it could then automatically set the data type and dimensions of the variable based on the array provided.

backends

netCDF4 is all we need at this point.

mhidas commented 6 years ago

The new package will be developed here: https://github.com/aodn/aodn-netcdf-tools

ocehugo commented 6 years ago

Just a final follow up before discussions over aodn-netcdf-tools:

I just had a discussion with @mhidas regarding @lwgordonimos suggestion:

The actual code uses a more implicit/inheritance style with a parent/children classes. Given the scope is reduced and not much is to be done by the "from_*" functions, I think the suggestion is good and will be simpler than creating a class (say JsonDataset) that inherit from others.

By the way, I think we should use ordered dicts, so the template can specify the order they are written to file.

I don't foresee any problems with that if we keep with python3. AFAIK dict in python3 preserve the order inserted. Anyway, this would be an easy change.

Yeah that could work. I guess it could then automatically set the data type and dimensions of the variable based on the array provided.

This is already implemented in a crude way, you set up the output file with A.set_output('/tmp/abc.nc') and them call "A.create()". Some consistency checks are done at init time, however. At the time I was thinking not in the template itself, but in writing everything from a smaller set of calls.

Also, filling the ncobj variables is not handled, but you can do it after the create step by:

A.ncobj['PRES'][:] = pres_values