NCAS-CMS / cf-python

A CF-compliant Earth Science data analysis library
http://ncas-cms.github.io/cf-python
MIT License
119 stars 19 forks source link

Unexpected behaviour comparing two views of the same data: pp and netcdf #775

Open bnlawrence opened 3 months ago

bnlawrence commented 3 months ago

Usecase: I read some pp data, and look at what I have. I then write the same data out to netcdf, and read it back in. I expect the list of cf-fields to be identical. But they are not.

ff=cf.read('myfile.pp`)
ff
[<CF Field: geopotential_height(time(40), air_pressure(9), latitude(1921), longitude(2560)) m>,
 <CF Field: id%UM_m01s30i301_vn1106(time(40), air_pressure(6), latitude(1921), longitude(2560))>,
 <CF Field: id%UM_m01s30i407_vn1106(time(40), latitude(1920), longitude(2560))>,
 <CF Field: id%UM_m01s30i408_vn1106(time(40), latitude(1920), longitude(2560))>]

compare with the same operation aftrer writing that list of fields out to a netcdf file

ff=cf.read('myfile.nc')
ff
[<CF Field: geopotential_height(time(40), air_pressure(9), latitude(1921), longitude(2560)) m>,
 <CF Field: long_name=HEAVYSIDE FN ON P LEV/UV GRID(time(40), air_pressure(6), latitude(1921), longitude(2560))>,
 <CF Field: long_name=TOTAL MOISTURE FLUX U  RHO GRID(time(40), latitude(1920), longitude(2560))>,
 <CF Field: long_name=TOTAL MOISTURE FLUX V  RHO GRID(time(40), latitude(1920), longitude(2560))>]

This is cf.__version__ = 3.16.2

From my point of view the file format should not affect the logical view of the contents. I understand there may be some historical reasons for this behaviour, but maybe they should be reviewed.

davidhassell commented 3 months ago

Hi Bryan,

The CF logical contents of the Fields are the same (the Fields read from Pp do have long_names) - it's just the repr view.

A bit of context here - when you read from PP files, the Fields have their id attribute set (https://ncas-cms.github.io/cf-python/attribute/cf.Field.id.html). This is because we need to unambiguously define the PP fields for aggregation, also because not all PP fields will have a standard or long name and so in the absence of netCDF variables names we need a mechanism to unambiguously identify them. The repr function has a hierarchy of identities from which it chooses to display. This hiercarchy goes standard_name, id, long_name, netcdf variable. The first of these to be set gets displayed, so for the PP case, when there is no standard name the id gets shown, because it is definitive (unlike the general long_name). When reading from netCDF files, no id is set because there is no obvious value to set it to, and no use for it.

Options:

  1. Do nothing.
  2. Remove the id attribute from fields read from PP (after aggregation).
  3. Change the repr identity hierarchy.

All three have pros and cons - I suspect one size does not fit all, here :)

bnlawrence commented 3 months ago

Ok, that makes sense, but in CF Python we have got the same logical content in both cases, so I expected to see the same thing. How horrible do you think the outcome would be if you did step 3? I guess, I'm wondering about the pros and cons (maybe this is a cf version 4 change, if at all).

sadielbartholomew commented 3 months ago

Thanks for the clarification of the context, David. I for one (two) was not aware of that.

All three have pros and cons - I suspect one size does not fit all, here :)

Indeed, I think the best solution would be to make it configurable so that if a user such as Bryan wants identical representations of the contents, they can get that, but they can also choose not to remove the id attribute from the PP, at appropriate points in each case.

So for me the decision is how to best support that with the API, and what the default behaviour would be, assuming we're happy to do the work to enable it.