JuliaGeo / GRIBDatasets.jl

A high level interface to GRIB encoded files.
MIT License
12 stars 5 forks source link

Harmonize API with NCDatasets.jl #8

Closed tcarion closed 1 year ago

tcarion commented 1 year ago

see #7

This might break some part of the API:

tcarion commented 1 year ago

I'm trying to implement the CommonDataModel.jl interface for GRIBDatasets, and I'm facing one issue because of my (non-very satisfying) implementation of Dimensions.

The AbstractDataset interface expects the ds.dim field to be a Dict like structure. I made it by extending Base.getindex and Base.keys on the GRIBDatasets.Dimensions type (which actually is a NTuple of Dimension).

The problem comes when we want to iterate on this ds.dim field. Extending Base.iterate for returning the name-length pair for each dimension, I break all the code where I want to iterate on the actual objects (the Dimension's of the tuple). If I don't extend Base.iterate, it breaks the CommonDataModel.show_dim method here, as it expects to iterate on a key-value pair.

One workaround would be to iterate on pairs(d) instead in CommonDataModel. This would make ds.dim not really a Dict-like object anymore, but it would make the internal implementation of dimensions more flexible. ~Also, I don't think it would break anything for NCDatasets (I didn't check though)~.

What would be your opinion on this @Alexander-Barth ? If you're ok, I will make the PR for CommonDataModel :-)

[UPDATE] I realize this would cause an issue in the write method of NCDatasets since it iterates on ds.dim. What about creating a getter dim method in CommonDataModel that returns a proper Dictionnary? I think it is also a good practice in Julia to define getter methods for accessing structure fields.

Alexander-Barth commented 1 year ago

Indeed, I agree that it is better to work with getter and setter methods. In NetCDF, we have these objects: dimensions, variables, attributes and groups. It would be nice to have a consistent API.

We could have either dims(ds) (or singular dim(ds)) to get a dict-like object. Or dimnames(ds) (already exists in NCDatasets) returning an iterable of all dimension names and a function dim(ds,dimname) (does not already exists) returning the length of the dimensions (and defDim(ds,dimname,dimlength) for writing to a NetCDF file)

For more complicated objects like variable (following the CF convention where fill value is replace by Missing for example) it is useful to have function call cfvariable(ds,varname,other_parameters) to pass additional parameters.

Would that second approach (dimnames/dim) work for you as well?

Se we could have these functions for all NetCDF objects:

get names/key get value write / set value
Dimensions dimnames dim defDim
Attributes attribnames attrib defAttrib
Variables varnames variable defVar
Groups groupnames group defGroup

For read-only datasets, you do not need to worry about the last column. For variables variable(ds,varname) returns the unscaled raw data and cfvariable(ds,varname,...) is the scaled data (and fill value is replace by Missing). ds[varname] is the same as cfvariable(ds,varname) with the default options.

In CommonDataModel we can easily handle attributes so that we do not need to anything for attributes. Another point to discuss would maybe how to handle the data scaling (add_offset/scale_factor) and fill values. Currently I handle this in NCDatasets (but not at all in TIFFDatasets.jl). Would it make sense to move this code to CommonDataModel ?

tcarion commented 1 year ago

This common interface looks good to me! I know it's a detail, but I can't decide about the plural vs singular form for dimensions! I see two options:

  1. dims(ds) to get the dict-like object and dim(ds, dimname) for getting the length.
  2. dim(ds) to get the dict-like object and dim(ds, dimname) for getting the length (same method for both).

I have a slight preference for 1., since the distinct method names give a hint about their distinct output types. What do you think?

About add_offset/scale_factor, it is something proper to NetCDF format, right? So maybe it should stay in NCDatasets? Or we can move it to CommonDataModel and set it by default to 0/1?

Alexander-Barth commented 1 year ago

OK, lets take option 1. Do you agree that we define dimnames(ds) and dim(ds,name) and similar in GRIBDatasets and NCDatasets and dims(ds) in a generic way in CommonDataModel?

About add_offset/scale_factor, it is something proper to NetCDF format, right? So maybe it should stay in NCDatasets? Or we can move it to CommonDataModel and set it by default to 0/1?

Actually add_offset/scalefactor/FillValue are also used in GeoTiff files (but under a slightly different names). There are also used in Zarr (https://github.com/cf-convention/cf-conventions/issues/422 ) . I would propose to move this into CommonDataModel (with the default no transformation; the transformation will only be done if the attributes are defined).

tcarion commented 1 year ago

Yes that's perfect for me!

Alexander-Barth commented 1 year ago

I just updated CommonDataModel by implementing the different getter function. So we no longer assume that there are particular fields like ds.dim/ds.attrib/ds.group.

I just tried out this PR with the updated CommonDataModel and the show methods does already work :-)

julia> ds = GRIBDatasets.Dataset(filename)
Dataset: /home/abarth/.julia/dev/GRIBDatasets/src/../test/sample-data/era5-levels-members.grib
Group: /

Variables
  longitude   (120)
    Datatype:    Float64
    Dimensions:  longitude
    Attributes:

  latitude   (61)
    Datatype:    Float64
    Dimensions:  latitude
    Attributes:

  number   (10)
    Datatype:    Int64
    Dimensions:  number
    Attributes:

  valid_time   (4)
    Datatype:    Int64
    Dimensions:  valid_time
    Attributes:

  level   (2)
    Datatype:    Int64
    Dimensions:  level

  z   (120 × 61 × 10 × 4 × 2)
    Datatype:    Union{Missing, Float64}
    Dimensions:  longitude × latitude × number × valid_time × level
    Attributes:
[...]

For NCDatasets.write I (currently) assume to have access to the raw data (i.e. without elements with the value missing, but the underlying fill value instead). Is it possible to read the raw values in GRIBDatasets ? If this is too complicated, I can also handle the special case for already transformed data (with fill value) in NCDatasets.write.

tcarion commented 1 year ago

That's nice, I'm a fan of the CommonDataModel.show method! I'll just make a few fixes (e.g. the attributes don't show) and it will be perfect.

It's currently not possible to get the raw value, but it shouldn't be too complicated to implement. I'm currently making some changes in the code, I can add this feature also. You mentioned cfvariable and variable above, do you intend to define them in CommonDataModel?

Alexander-Barth commented 1 year ago

I'll just make a few fixes (e.g. the attributes don't show) and it will be perfect.

This is great ! :-)

You mentioned cfvariable and variable above, do you intend to define them in CommonDataModel?

Yes, I think that would be ideal so that I can use the same code between TIFFDatasets and NCDatasets. I will make a try on my side, to see how far it goes. In any case, you would always have the possibility to override this kind of functionality.

tcarion commented 1 year ago

I implemented CFVariable for GRIBDatasets. Meanwhile I saw you moved it to CommonDataModel, but that's ok, it will be easy to adapt it!

Alexander-Barth commented 1 year ago

Oh, yes. Sorry if I was not clear. The idea is to have it at one place centrally to avoid code duplication. I yesterday gave a try TIFFDatasets (using CFVariable) and after that the conversion GeoTIFF worked fine.

Without optimizing the translation, the runtime seem to be a similar to the gdal_translate binary:

 @time NCDatasets.write("tmp.nc",TIFFDataset(fname))
# 0.079186 seconds (933.11 k allocations: 26.873 MiB)
$ time gdal_translate -of netCDF /tmp/jl_bQTKrmdYTs output.nc 
Input file size is 256, 256
0...10...20...30...40...50...60...70...80...90...100 - done.
real    0m0.127s

Of course this does include julia's precompilation, and gdal_translate can do much much more :-)

So essentially, one needs to the define a function variable(ds,varname) which return the un-scaled (with fill value included) data, for example:

https://github.com/Alexander-Barth/TIFFDatasets.jl/blob/536d752a7ec8dd3a7d0ce4a1fbee201f66469196/src/TIFFDatasets.jl#L213

Then, this function will be called by cfvariable(ds,varname) (defined in CommonDataModel) when you sub-index the dataset:

https://github.com/Alexander-Barth/TIFFDatasets.jl/blob/536d752a7ec8dd3a7d0ce4a1fbee201f66469196/src/TIFFDatasets.jl#L262

cfvariable creates a CFVariable object wrapping a Variable object. It also inspect the attribute like _FillValue, missing_value, add_offset, scale_factor, and for time variable units and calendar to make some transformation of the variable (returning directly julia's DateTime for time variable). The attribute missing_value is a bit redundant with _FillValue but still standardized by the CF convention. (The only subtle difference it that missing_value can be a list of values while _FillValue can only be a single value). I think in your case, you have typically only fill value, right? So would would be sufficient to set the attribute _FillValue (or missing_value)

tcarion commented 1 year ago

Thanks for the explanations, I think I grabbed the idea! I'll try to implement that in GRIBDatasets tomorrow. I see only one issue ahead. My idea was that, in GRIBDatasets, the Variable has some attributes included in the original GRIB metadata (in the case the user has an interest), while CFVariable has only the attributes related to the CF conventions. In CommonDataModel, the CFVariable attrib are necessarily the same as the related Variable. Is it possible to add a kwarg in the method to have control on the CFVariable attributes? Something like:

function cfvariable(ds,
                    varname;
                    _v = variable(ds,varname),
                    attrib = _v.attrib
                     [...]
                    )
    [...]
    return CFVariable{rettype,ndims(v),typeof(v),typeof(_v.attrib),typeof(storage_attrib)}(
        v,_attrib,storage_attrib)
end

I can make to PR to CommonDataModel if you're OK :)

tcarion commented 1 year ago

It was actually pretty simple to adapt :-) That's really nice, it magically solved the problem of getting the time variable coordinate as DateTime!

tcarion commented 1 year ago

It's starting to look good!

Dataset:  .../.julia/dev/GRIBDatasets/test/sample-data/era5-levels-members.grib
Group: /

Dimensions
   lon = 120
   lat = 61
   isobaricInhPa = 2
   number = 10
   valid_time = 4

Variables
     (120)
    Datatype:    Float64
    Dimensions:  lon
    Attributes:
     units                = degrees_east
     long_name            = longitude
     standard_name        = longitude

...
     (2)
    Datatype:    Int64
    Dimensions:  isobaricInhPa
    Attributes:
     units                = hPa
     stored_direction     = decreasing
     long_name            = pressure
     axis                 = Z
     standard_name        = air_pressure
     positive             = down

...
     (4)
    Datatype:    DateTime
    Dimensions:  valid_time
    Attributes:
     units                = seconds since 1970-01-01T00:00:00
     calendar             = proleptic_gregorian
     long_name            = time
     standard_name        = time

     (120 × 61 × 2 × 10 × 4)
    Datatype:    Union{Missing, Float64}
    Dimensions:  lon × lat × isobaricInhPa × number × valid_time
    Attributes:
     units                = m**2 s**-2
     long_name            = Geopotential
     standard_name        = geopotential
...

I see only one minor issue. The GRIB format provides a missing value even if there is no missing in the data (which netCDF doesn't do, right?). That makes any non-coordinate variable of eltype Union{Missing, T} in every case. I don't think it's too much of a deal, but do you see any simple workaround?

Alexander-Barth commented 1 year ago

My idea was that, in GRIBDatasets, the Variable has some attributes included in the original GRIB metadata (in the case the user has an interest), while CFVariable has only the attributes related to the CF conventions. In CommonDataModel, the CFVariable attrib are necessarily the same as the related Variable.

Thank you for this proposed change. I just committed it:

https://github.com/JuliaGeo/CommonDataModel.jl/commit/a19619f8a679734da55ef9840ad8856a09788b33

(But having all attributes from the GRIB metadata and the attributes from the CF convention in the same list would actually correspond to what I typically see in most NetCDF files where we have also a mix of standard attribute and non-standard attributes :-) Also it does not prevent you to pass validation.)

Great that it recognized the time variable automatically :-)

I see only one minor issue. The GRIB format provides a missing value even if there is no missing in the data (which netCDF doesn't do, right?). That makes any non-coordinate variable of eltype Union{Missing, T} in every case. I don't think it's too much of a deal, but do you see any simple workaround?

Indeed, in netCDF, the FillValue_ attribute can be set or not. In the later case, we know that the eltype can be e.g. just Float64. If I understand correctly, for grib files you only know if the fill value is used or not after reading all the data?

tcarion commented 1 year ago

Ok, I see. The thing is that there can be lots of attributes from the GRIB file, and it reduces the readability when showing all the dataset (see below). But if you prefer to keep it the previous way, I'm OK with that (and I can also prune into the GRIB attributes that are kept because some are actually useless and repetitive).

z (120 × 61 × 2 × 10 × 4)
  Datatype:    Float64
  Dimensions:  lon × lat × isobaricInhPa × number × valid_time
  Attributes:
   jDirectionIncrementInDegrees = 3.0
   dataType             = an
   stepUnits            = 1
   jPointsAreConsecutive = 0
   name                 = Geopotential
   jScansPositively     = 0
   latitudeOfLastGridPointInDegrees = -90.0
   iScansNegatively     = 0
   numberOfPoints       = 7320
   missingValue         = 9999
   gridType             = regular_ll
   gridDefinitionDescription = Latitude/Longitude Grid
   NV                   = 0
   iDirectionIncrementInDegrees = 3.0
   typeOfLevel          = isobaricInhPa
   Nx                   = 120
   latitudeOfFirstGridPointInDegrees = 90.0
   Ny                   = 61
   shortName            = z
   stepType             = instant
   units                = m**2 s**-2
   totalNumber          = 10
   paramId              = 129
   longitudeOfFirstGridPointInDegrees = 0.0
   cfName               = geopotential
   longitudeOfLastGridPointInDegrees = 357.0
   cfVarName            = z

If we keep it this way, the attribnames and attrib methods should work on the attrib field of the cfvar instead of cfvar.var. Same for show, it shouldn't delegate to cfvar.var. I made the changes here: https://github.com/JuliaGeo/CommonDataModel.jl/pull/4

Alexander-Barth commented 1 year ago

Ah yes; there are indeed a lot of attributes :-) I would suggest to choose the approach that works best for you. Reducing the number of attributes in CFVariable, as you did, makes indeed perfect sense.

If we keep it this way, the attribnames and attrib methods should work on the attrib field of the cfvar instead of cfvar.var

Yes, I agree.

Same for show, it shouldn't delegate to cfvar.var

I just noticed the same issue this morning :-) Thanks for the PR (it is merged now).

I would like to do soon a new release of CommonDataModel. Let me know if there is anything that should be done first.

tcarion commented 1 year ago

Please go ahead, everything seems to work from my side! I'll merge this to main as soon as the new release of CommonDataModel is ready

Alexander-Barth commented 1 year ago

I see only one minor issue. The GRIB format provides a missing value even if there is no missing in the data (which netCDF doesn't do, right?). That makes any non-coordinate variable of eltype Union{Missing, T} in every case. I don't think it's too much of a deal, but do you see any simple workaround?

A possibility for the user would be to ignore the fill value by using cfvariable(ds,varname,fillvalue = nothing) (instead of ds[varname])

tcarion commented 1 year ago

OK, I will have to mention that in the docs somewhere, thank you!