Open mhidas opened 8 years ago
In my opinion CDL would be the most appropriate format to use for these templates. These can be very easily converted into a netCDF file with ncgen and then opened as a NetCDF4 Dataset object.
As for the location, something like data-services/lib/netcdf/templates would make sense. The tricky bit will be how to make this available to the Matlab Toolbox as well...
converted into a netCDF file with ncgen
the only issue with this is we would have to do a subprocess call from python, which is never an ideal solution.
There are many ways fixed attribute values (such as the license, disclaimer and acknowledgement strings required by the IMOS conventions) could be stored. The idea is to keep these separate from code that creates netCDF files. Here are some options
Currently, IMOSnetCDF.py in data-services uses text files like this, which are not complete CDL files, but they store global and variable attributes the same way as CDL.
e.g. https://github.com/aodn/data-services/blob/master/SOOP/SOOP_XBT/DELAYED/generate_nc_file_att
e.g. global attributes: https://github.com/aodn/imos-toolbox/blob/master/NetCDF/template/global_attributes_timeSeries.txt variable attributes: https://github.com/aodn/imos-toolbox/blob/master/IMOS/imosParameters.txt
Attribute values can be specified directly in a Python module and simply imported, e.g. https://github.com/aodn/data-services/blob/master/ACORN/current_generator/acorn_constants.py
@smancini @lbesnard @ggalibert @bpasquer To be discussed.
@mhidas,
I've a code that read cdl from python.
It need some refactoring into regex (its pretty primitive). It doens't support groups, but there are a battery of tests from the unidata website cdls and is pure python using only the re module.
It works for my needs but I would like to extend with more regex and support groups.
Let me know if you wanna have a go.
PS: there is also TOML.
After doing a bit of testing, I'm of the opinion that the https://github.com/rockdoc/cdlparser code could be what we are after. It parses CDL files, generates a netcdf file or keep the netcdf object open in python.
example:
1) in bash
ncdump http://thredds.aodn.org.au/thredds/dodsC/IMOS/SOOP/SOOP-SST/9HA2479_Pacific-Sun/2010/IMOS_SOOP-SST_MT_20101212T120000Z_9HA2479_FV01_C-20120528T071954Z.nc > /tmp/soop.cdl
2) in python
from cdlparser import CDL3Parser
myparser = CDL3Parser(close_on_completion=False, file_format='NETCDF4_CLASSIC')
ncdataset = myparser.parse_file("/tmp/soop.cdl",ncfile="/tmp/soop.nc")
ncdataset.close()
However the generated netcdf file and the orginal cdl ends up missing some attributes (and maybe other things). The code would have to be forked an improved
When I was working on generating files with the netcdfgenerator 2 years ago, I remember having issues to generate NetCDF4 file using CDL It's probably been improved since then and so If the cdlparser meets our needs like @lbesnard is suggesting (but further tests are needed), using CDL is the best option in my opinion.
if you are going to use https://github.com/rockdoc/cdlparser, check if they implemented python3 support.
I did a pr to support python 3 sometime ago. It didn't got into upstream because my pr was disorganized . My ode already worked fine for my case, so i didn't push forward. I couldn't remember much why I did a new code...maybe something cdlparser was failing, or I was not aware of it at the time, or I was too optmistic to finish my code with groups support and all the bells.
Anyway, I don't see a problem with raw python code either, but json is a clear win if you plan to read those things over the wire.
re https://github.com/rockdoc/cdlparser, I was having issues with the generated file missing many attributes, data. This is because I used file_format='NETCDF4_CLASSIC'
instead of file_format='NETCDF4'
. Apparently a bug from the netcdf library.
So, doing the following (see below), will create a correct NetCDF file from a CDL format. Will keep on testing with other datasets:
1) in bash
ncdump http://thredds.aodn.org.au/thredds/dodsC/IMOS/SOOP/SOOP-SST/9HA2479_Pacific-Sun/2010/IMOS_SOOP-SST_MT_20101212T120000Z_9HA2479_FV01_C-20120528T071954Z.nc > /tmp/soop.cdl
2) in python
from cdlparser import CDL3Parser
myparser = CDL3Parser(close_on_completion=False, file_format='NETCDF4')
ncdataset = myparser.parse_file("/tmp/soop.cdl", ncfile="/tmp/soop.nc")
ncdataset.close()
3) bash, to check the diff between the new netcdf and the cdl file
diff /tmp/soop.cdl < (ncdump /tmp/soop.nc)
check if they implemented python3 support.
No They didn't but we use python 2.7 anyway
We do want to eventually move to python 3 though! (Support for Python 2 ends in a year and a half, and many packages we use will stop supporting it before then - see e.g. https://python3statement.org/)
My preference would go for CDL, especially if we're only hard coding the content of the attributes.
But if we want to be a bit more flexible and allow for external resources to document some attributes (like when the toolbox tries to inject information from a deployment database or a Matlab expression) then JSON might be neater. @jonescc correct me if I'm wrong but I think this is what you are already doing in gogoduck and netcdf generator?
The netcdf generator and gogoduck use their own xml format. Geoserver uses xml for its configuration and this is where they were located at the time. We had to write support for translating those formats into netcdf attributes which you would also have to do if you don't use an existing mechanism such as cdl or ncml.
After chatting to @ocehugo and playing around in Python, I am now only half-convinced that CDL is the way to go.
For just global attributes there's no problem, you can have a valid CDL (and equivalend netCDF) file containing just global attributes. However, we also want to store variable attributes, and to put them in a valid CDL file, we need to define the variables themselves, complete with data type and dimensions. We can do that too, though this is a bit overkill when we just want to specify a few attributes.
More importantly, when creating these templates, we don't actually know what the exact structure of the final file will be, or at least the size of the dimensions. So we can't just read a CDL template straight into a netCDF4.Dataset object (as cdlparser does), add, data, then save to netCDF. Before we can add data, we have to get the dimensions right, but once a dimension or variable is defined in a Dataset object, you can't change its structure.
Instead, the workflow needs to be something like this:
[template file] => [intermediate Python object]
=> set correct dimension sizes and variable dimensions
=> convert to netCDF4 object
=> add data arrays
=> write file
The obvious [intermediate Python object] would be a dictionary (with nested dicts for each variable), which could either be defined in Python code, or in json.
If you need to manipulate it from Python, then it's kind of a moot point with JSON vs. dict... at that point JSON is effectively a serialised dict and a dict is a deserialised JSON object.
we need to defile the variables themselves
We definitely need to avoid defiling things if possible, but that Freudian slip kind of leads onto the next thought I had... you can complement a JSON structure with http://json-schema.org/, as it says on the page it "Describes your existing data format(s)."
Having totally unconstrained JSON is just a big bag of keys and values which is a recipe for bugs. You can't get particularly rich types in JSON, but you can at least avoid totally arbitrary data structures.
The other advantage about using JSON or XML as opposed to CDL is that you could define the file format (NETCDF3, NETCDF4, etc), chunking and compression level per variable.
@ggalibert, this is the exact reason why I wrote a class-wrapper for writing netCDF4.Datasets (that sit below the cdlreader).
The Class, DictDataset, is initialized with 3 different dictionaries: dimensions, variables and global attributes. It postpones the actual netCDF Dataset creation to the "create" method of it. You can define variable dimensions as keys in the other dictionary, as well as chunking/compression per variable. It's useful because everything is within a single dictionary structure that can be reused or even summed with other DictDatasets (the class has the "add" method).
@mhidas saw it in action, and it requires only some small fixes/changes depending on how the template is to be defined:
atemplate.py => x = DictDataset(from_file=atemplate.py)
=> set correct dimension sizes and variable dimensions (x.dimensions['x'] = X) #if 'x' in template is different from X
=> convert to netCDF4 object (x.set_file(outputfile),x.create())
=> add data arrays (x.ncobj[varname][:] = var)
=> write file (x.ncobj.sync())
To support the workflow proposed above, we will create a new Python package for writing netCDF files using templates. If @ocehugo is happy to contribute his code, it could be based on his DictDataset
class. At the minimum, it will need to implement the basic functionality described below.
Note that reading in or creating the data values is outside the scope of this package. The most convenient way to provide the data would be in numpy arrays or a Pandas dataframe.
Read one or more template files and return a dictionary-like template object (e.g. DictDataset
) representing the file structure and attributes. The template file format will be JSON (optionally readers for other formats could be implemented).
e.g.
template = DictDataset(from_file='template.json')
Update dimensions, variables and attributes in a template object. This should be as simple as adding or updating entries in a dictionary. e.g.
template.dimensions['TIME'] = 100
template.variables['PRES']['units'] = 'dbar'
template.title = 'Test dataset'
Create a netCDF4.Dataset
object and add the dimensions, variables and attributes specified in the template.
e.g.
template.create(filename) # user-specified file name
template.create() # auto-generated file name using IMOS conventions
This could actually be called automatically at the start of the add_data
method, if it hasn't been explicitly called yet.
Add values from numpy arrays or a Pandas dataframe into the variables defined in the template. This is already done in several exising bits of data-services/aodndata code, but they could be slightly streamlined by offering a single function to do it. If the column names in a dataframe match the variables in the template, the code can match them up automatically. Otherwise each individual data array will need to be specified separately. e.g.
template.add_data(dataframe)
template.add_data(TIME=time_values, TEMP=temp_values, PRES=pres_values)
Close the netCDF object to finish writing the file.
The classmethod alternate constructor pattern would be a good fit to decouple the object from the source format, e.g.
template1 = DictDataset(source_dict)
template2 = DictDataset.from_json(path='template.json')
template3 = DictDataset.from_ini(path='template.ini')
Could we still add data by just doing:
template.variables['PRES'] = pres_values
?
I'm happy to change the Dictdataset code I have.
I think the first thing to raise is some testing cases (json/ini/cdl) .
@mhidas, can you pull out some (or all) the aodn templates already in use? I assumed we would like to avoid rewriting some of them to avoid breaking from the start and go moving them slowly to json.
The DictDataset at the moment accepts 3 dict inputs (dims, vars,global_att). Is this way because it was the way my cdlreader output things from a valid CDL.
Some things in my mind now:
You guys think its better to have a single input dict to rule them all!? This would match json...
template1 = DictDataset(source_dict)
or
template1 = DictDataset(dims=d_dims,vars=d_vars,gattr=d_gattr)
"append mode" to an already created NetCDF file is required in the the short term? This would force us to provide a template from a netCDF4.Dataset object:
template = DictDataset.from_dataset(path='file.nc')
I think delay the creation of the Dataset after add_data method is a better strategy. This would allow leaving dimensions specification to be later evaluated plus other things. Just store the ata as a "value" key (validation would be at create() anyway).
backends: netCDF4 is assumed from start. Can't see it going anytime soon, but h5netcdf could be an option!?
You guys think its better to have a single input dict to rule them all!? This would match json...
Doesn't really matter in terms of the design pattern, both would work depending on what makes sense to you guys. Ideally the regular init could just take native Python dicts to construct the instance, and the from_dict/from_dataset etc. methods can then basically be wrappers which retrieve/transform things into dicts to feed into init, e.g:
class DictDataset(object):
def __init__(self, d_dims, vars, gattr):
pass
@classmethod
def from_json(cls, path):
# load the JSON file into a dict
with open(path) as f:
template = json.load(f)
# e.g. this could call out to JSONschema to make sure the JSON has the expected
# high level structure, and could refuse to create the object right here if it wasn't correct
validate_template(template)
# instantiate using the regular __init__ method
return cls(d_dims=template['d_dims'], vars=template['vars'], gattr=template['gattr'])
So if you want to instantiate from a Python context where you have dicts already, you just use the regular init, e.g.
d_dims = dict()
vars = dict()
gattr = dict()
my_dataset = DictDataset(d_dims, vars, gattr)
Or if you want to start from a JSON file you use the classmethod constructor and ultimately end up with the same object:
template.json
{
"d_dims": {},
"vars": {},
"gattr": {}
}
my_dataset = DictDataset.from_json(path='template.json')
In any case, this pattern totally decouples the source format from the Python object, because after initialisation it makes zero difference where the original parameters came from.
It's great for creating different flavours of pizza too: https://realpython.com/instance-class-and-static-methods-demystified/#delicious-pizza-factories-with-classmethod
template.variables['PRES'] = pres_values
Not quite. template.variables['PRES']
needs to be a dictionary, so it can store the variable's type, dimensions, and attribute values. We could replicate the netCDF4 interface, so you can do this:
template.variables['PRES'][:] = pres_values
You guys think its better to have a single input dict to rule them all!?
That's what I was thinking, but it probably doesn't really matter. I think most of the code would be reading the template from a file, so it wouldn't actually need to construct the One dict (or call the constructor with three dicts in the other case). I guess having the init() accept three separate dicts makes it easier to create a simple template that specifies e.g. only global attributes:
template = DictDataset(gattr={'title': 'test file', 'author': 'me'})
By the way, I think we should use ordered dicts, so the template can specify the order they are written to file.
"append mode" to an already created NetCDF file is required in the the short term?
I don't think that's needed for our main use case. Could be something to add later.
I think delay the creation of the Dataset after add_data method is a better strategy.
Yeah that could work. I guess it could then automatically set the data type and dimensions of the variable based on the array provided.
backends
netCDF4 is all we need at this point.
The new package will be developed here: https://github.com/aodn/aodn-netcdf-tools
Just a final follow up before discussions over aodn-netcdf-tools:
I just had a discussion with @mhidas regarding @lwgordonimos suggestion:
The actual code uses a more implicit/inheritance style with a parent/children classes. Given the scope is reduced and not much is to be done by the "from_*" functions, I think the suggestion is good and will be simpler than creating a class (say JsonDataset) that inherit from others.
By the way, I think we should use ordered dicts, so the template can specify the order they are written to file.
I don't foresee any problems with that if we keep with python3. AFAIK dict in python3 preserve the order inserted. Anyway, this would be an easy change.
Yeah that could work. I guess it could then automatically set the data type and dimensions of the variable based on the array provided.
This is already implemented in a crude way, you set up the output file with A.set_output('/tmp/abc.nc') and them call "A.create()". Some consistency checks are done at init time, however. At the time I was thinking not in the template itself, but in writing everything from a smaller set of calls.
Also, filling the ncobj variables is not handled, but you can do it after the create step by:
A.ncobj['PRES'][:] = pres_values
Many of our processes that create netCDF files require templates to set global (and variable) attributes. Currently this is being done in a variety of different ways, from multiple sources, often setting the same basic attributes (project, acknowledgements, etc..):
There may be others too...
There are two issues here:
It would be helpful to come up with a solution that removes, or at least minimises, both of these issues.