Closed tcarion closed 1 year ago
I'm trying to implement the CommonDataModel.jl interface for GRIBDatasets, and I'm facing one issue because of my (non-very satisfying) implementation of Dimensions
.
The AbstractDataset
interface expects the ds.dim
field to be a Dict like structure. I made it by extending Base.getindex
and Base.keys
on the GRIBDatasets.Dimensions
type (which actually is a NTuple
of Dimension
).
The problem comes when we want to iterate on this ds.dim
field. Extending Base.iterate
for returning the name-length pair for each dimension, I break all the code where I want to iterate on the actual objects (the Dimension
's of the tuple). If I don't extend Base.iterate
, it breaks the CommonDataModel.show_dim
method here, as it expects to iterate on a key-value pair.
One workaround would be to iterate on pairs(d)
instead in CommonDataModel. This would make ds.dim
not really a Dict-like object anymore, but it would make the internal implementation of dimensions more flexible. ~Also, I don't think it would break anything for NCDatasets (I didn't check though)~.
What would be your opinion on this @Alexander-Barth ? If you're ok, I will make the PR for CommonDataModel :-)
[UPDATE]
I realize this would cause an issue in the write
method of NCDatasets since it iterates on ds.dim
. What about creating a getter dim
method in CommonDataModel that returns a proper Dictionnary? I think it is also a good practice in Julia to define getter methods for accessing structure fields.
Indeed, I agree that it is better to work with getter and setter methods. In NetCDF, we have these objects: dimensions, variables, attributes and groups. It would be nice to have a consistent API.
We could have either dims(ds)
(or singular dim(ds)
) to get a dict-like object. Or dimnames(ds)
(already exists in NCDatasets) returning an iterable of all dimension names and a function dim(ds,dimname)
(does not already exists) returning the length of the dimensions (and defDim(ds,dimname,dimlength)
for writing to a NetCDF file)
For more complicated objects like variable (following the CF convention where fill value is replace by Missing for example) it is useful to have function call cfvariable(ds,varname,other_parameters)
to pass additional parameters.
Would that second approach (dimnames/dim) work for you as well?
Se we could have these functions for all NetCDF objects:
get names/key | get value | write / set value | |
---|---|---|---|
Dimensions | dimnames |
dim |
defDim |
Attributes | attribnames |
attrib |
defAttrib |
Variables | varnames |
variable |
defVar |
Groups | groupnames |
group |
defGroup |
For read-only datasets, you do not need to worry about the last column.
For variables variable(ds,varname)
returns the unscaled raw data and cfvariable(ds,varname,...)
is the scaled data (and fill value is replace by Missing). ds[varname]
is the same as cfvariable(ds,varname)
with the default options.
In CommonDataModel we can easily handle attributes so that we do not need to anything for attributes. Another point to discuss would maybe how to handle the data scaling (add_offset/scale_factor) and fill values. Currently I handle this in NCDatasets (but not at all in TIFFDatasets.jl). Would it make sense to move this code to CommonDataModel ?
This common interface looks good to me! I know it's a detail, but I can't decide about the plural vs singular form for dimensions! I see two options:
dims(ds)
to get the dict-like object and dim(ds, dimname)
for getting the length.dim(ds)
to get the dict-like object and dim(ds, dimname)
for getting the length (same method for both).I have a slight preference for 1., since the distinct method names give a hint about their distinct output types. What do you think?
About add_offset/scale_factor, it is something proper to NetCDF format, right? So maybe it should stay in NCDatasets? Or we can move it to CommonDataModel and set it by default to 0/1?
OK, lets take option 1.
Do you agree that we define dimnames(ds)
and dim(ds,name)
and similar in GRIBDatasets and NCDatasets and dims(ds)
in a generic way in CommonDataModel?
About add_offset/scale_factor, it is something proper to NetCDF format, right? So maybe it should stay in NCDatasets? Or we can move it to CommonDataModel and set it by default to 0/1?
Actually add_offset/scalefactor/FillValue are also used in GeoTiff files (but under a slightly different names). There are also used in Zarr (https://github.com/cf-convention/cf-conventions/issues/422 ) . I would propose to move this into CommonDataModel (with the default no transformation; the transformation will only be done if the attributes are defined).
Yes that's perfect for me!
I just updated CommonDataModel by implementing the different getter function. So we no longer assume that there are particular fields like ds.dim
/ds.attrib
/ds.group
.
I just tried out this PR with the updated CommonDataModel and the show methods does already work :-)
julia> ds = GRIBDatasets.Dataset(filename)
Dataset: /home/abarth/.julia/dev/GRIBDatasets/src/../test/sample-data/era5-levels-members.grib
Group: /
Variables
longitude (120)
Datatype: Float64
Dimensions: longitude
Attributes:
latitude (61)
Datatype: Float64
Dimensions: latitude
Attributes:
number (10)
Datatype: Int64
Dimensions: number
Attributes:
valid_time (4)
Datatype: Int64
Dimensions: valid_time
Attributes:
level (2)
Datatype: Int64
Dimensions: level
z (120 × 61 × 10 × 4 × 2)
Datatype: Union{Missing, Float64}
Dimensions: longitude × latitude × number × valid_time × level
Attributes:
[...]
For NCDatasets.write
I (currently) assume to have access to the raw data (i.e. without elements with the value missing, but the underlying fill value instead). Is it possible to read the raw values in GRIBDatasets ? If this is too complicated, I can also handle the special case for already transformed data (with fill value) in NCDatasets.write
.
That's nice, I'm a fan of the CommonDataModel.show
method!
I'll just make a few fixes (e.g. the attributes don't show) and it will be perfect.
It's currently not possible to get the raw value, but it shouldn't be too complicated to implement. I'm currently making some changes in the code, I can add this feature also.
You mentioned cfvariable
and variable
above, do you intend to define them in CommonDataModel?
I'll just make a few fixes (e.g. the attributes don't show) and it will be perfect.
This is great ! :-)
You mentioned cfvariable and variable above, do you intend to define them in CommonDataModel?
Yes, I think that would be ideal so that I can use the same code between TIFFDatasets and NCDatasets. I will make a try on my side, to see how far it goes. In any case, you would always have the possibility to override this kind of functionality.
I implemented CFVariable for GRIBDatasets. Meanwhile I saw you moved it to CommonDataModel, but that's ok, it will be easy to adapt it!
Oh, yes. Sorry if I was not clear. The idea is to have it at one place centrally to avoid code duplication. I yesterday gave a try TIFFDatasets (using CFVariable) and after that the conversion GeoTIFF worked fine.
Without optimizing the translation, the runtime seem to be a similar to the gdal_translate
binary:
@time NCDatasets.write("tmp.nc",TIFFDataset(fname))
# 0.079186 seconds (933.11 k allocations: 26.873 MiB)
$ time gdal_translate -of netCDF /tmp/jl_bQTKrmdYTs output.nc
Input file size is 256, 256
0...10...20...30...40...50...60...70...80...90...100 - done.
real 0m0.127s
Of course this does include julia's precompilation, and gdal_translate
can do much much more :-)
So essentially, one needs to the define a function variable(ds,varname)
which return the un-scaled (with fill value included) data, for example:
Then, this function will be called by cfvariable(ds,varname)
(defined in CommonDataModel) when you sub-index the dataset:
cfvariable
creates a CFVariable object wrapping a Variable object. It also inspect the attribute like _FillValue
, missing_value
, add_offset
, scale_factor
, and for time variable units
and calendar
to make some transformation of the variable (returning directly julia's DateTime for time variable). The attribute missing_value
is a bit redundant with _FillValue
but still standardized by the CF convention. (The only subtle difference it that missing_value
can be a list of values while _FillValue
can only be a single value). I think in your case, you have typically only fill value, right? So would would be sufficient to set the attribute _FillValue
(or missing_value
)
Thanks for the explanations, I think I grabbed the idea! I'll try to implement that in GRIBDatasets tomorrow. I see only one issue ahead.
My idea was that, in GRIBDatasets, the Variable
has some attributes included in the original GRIB metadata (in the case the user has an interest), while CFVariable
has only the attributes related to the CF conventions. In CommonDataModel, the CFVariable
attrib are necessarily the same as the related Variable
. Is it possible to add a kwarg in the method to have control on the CFVariable
attributes? Something like:
function cfvariable(ds,
varname;
_v = variable(ds,varname),
attrib = _v.attrib
[...]
)
[...]
return CFVariable{rettype,ndims(v),typeof(v),typeof(_v.attrib),typeof(storage_attrib)}(
v,_attrib,storage_attrib)
end
I can make to PR to CommonDataModel if you're OK :)
It was actually pretty simple to adapt :-) That's really nice, it magically solved the problem of getting the time variable coordinate as DateTime!
It's starting to look good!
Dataset: .../.julia/dev/GRIBDatasets/test/sample-data/era5-levels-members.grib
Group: /
Dimensions
lon = 120
lat = 61
isobaricInhPa = 2
number = 10
valid_time = 4
Variables
(120)
Datatype: Float64
Dimensions: lon
Attributes:
units = degrees_east
long_name = longitude
standard_name = longitude
...
(2)
Datatype: Int64
Dimensions: isobaricInhPa
Attributes:
units = hPa
stored_direction = decreasing
long_name = pressure
axis = Z
standard_name = air_pressure
positive = down
...
(4)
Datatype: DateTime
Dimensions: valid_time
Attributes:
units = seconds since 1970-01-01T00:00:00
calendar = proleptic_gregorian
long_name = time
standard_name = time
(120 × 61 × 2 × 10 × 4)
Datatype: Union{Missing, Float64}
Dimensions: lon × lat × isobaricInhPa × number × valid_time
Attributes:
units = m**2 s**-2
long_name = Geopotential
standard_name = geopotential
...
I see only one minor issue. The GRIB format provides a missing value even if there is no missing in the data (which netCDF doesn't do, right?). That makes any non-coordinate variable of eltype
Union{Missing, T}
in every case. I don't think it's too much of a deal, but do you see any simple workaround?
My idea was that, in GRIBDatasets, the
Variable
has some attributes included in the original GRIB metadata (in the case the user has an interest), whileCFVariable
has only the attributes related to the CF conventions. In CommonDataModel, theCFVariable
attrib are necessarily the same as the relatedVariable
.
Thank you for this proposed change. I just committed it:
https://github.com/JuliaGeo/CommonDataModel.jl/commit/a19619f8a679734da55ef9840ad8856a09788b33
(But having all attributes from the GRIB metadata and the attributes from the CF convention in the same list would actually correspond to what I typically see in most NetCDF files where we have also a mix of standard attribute and non-standard attributes :-) Also it does not prevent you to pass validation.)
Great that it recognized the time variable automatically :-)
I see only one minor issue. The GRIB format provides a missing value even if there is no missing in the data (which netCDF doesn't do, right?). That makes any non-coordinate variable of eltype Union{Missing, T} in every case. I don't think it's too much of a deal, but do you see any simple workaround?
Indeed, in netCDF, the FillValue_ attribute can be set or not. In the later case, we know that the eltype
can be e.g. just Float64
. If I understand correctly, for grib files you only know if the fill value is used or not after reading all the data?
Ok, I see. The thing is that there can be lots of attributes from the GRIB file, and it reduces the readability when showing all the dataset (see below). But if you prefer to keep it the previous way, I'm OK with that (and I can also prune into the GRIB attributes that are kept because some are actually useless and repetitive).
z (120 × 61 × 2 × 10 × 4)
Datatype: Float64
Dimensions: lon × lat × isobaricInhPa × number × valid_time
Attributes:
jDirectionIncrementInDegrees = 3.0
dataType = an
stepUnits = 1
jPointsAreConsecutive = 0
name = Geopotential
jScansPositively = 0
latitudeOfLastGridPointInDegrees = -90.0
iScansNegatively = 0
numberOfPoints = 7320
missingValue = 9999
gridType = regular_ll
gridDefinitionDescription = Latitude/Longitude Grid
NV = 0
iDirectionIncrementInDegrees = 3.0
typeOfLevel = isobaricInhPa
Nx = 120
latitudeOfFirstGridPointInDegrees = 90.0
Ny = 61
shortName = z
stepType = instant
units = m**2 s**-2
totalNumber = 10
paramId = 129
longitudeOfFirstGridPointInDegrees = 0.0
cfName = geopotential
longitudeOfLastGridPointInDegrees = 357.0
cfVarName = z
If we keep it this way, the attribnames
and attrib
methods should work on the attrib
field of the cfvar
instead of cfvar.var
. Same for show
, it shouldn't delegate to cfvar.var
. I made the changes here: https://github.com/JuliaGeo/CommonDataModel.jl/pull/4
Ah yes; there are indeed a lot of attributes :-) I would suggest to choose the approach that works best for you. Reducing the number of attributes in CFVariable, as you did, makes indeed perfect sense.
If we keep it this way, the attribnames and attrib methods should work on the attrib field of the cfvar instead of cfvar.var
Yes, I agree.
Same for show, it shouldn't delegate to cfvar.var
I just noticed the same issue this morning :-) Thanks for the PR (it is merged now).
I would like to do soon a new release of CommonDataModel. Let me know if there is anything that should be done first.
Please go ahead, everything seems to work from my side! I'll merge this to main as soon as the new release of CommonDataModel is ready
I see only one minor issue. The GRIB format provides a missing value even if there is no missing in the data (which netCDF doesn't do, right?). That makes any non-coordinate variable of eltype Union{Missing, T} in every case. I don't think it's too much of a deal, but do you see any simple workaround?
A possibility for the user would be to ignore the fill value by using cfvariable(ds,varname,fillvalue = nothing)
(instead of ds[varname]
)
OK, I will have to mention that in the docs somewhere, thank you!
see #7
This might break some part of the API:
dims
inGRIBDataset
's andVariable
's was renamed todim
ds.dim[dimname]
no longer return theDimension
object, but the length of the dimension, as in NCDatasets.jl