Data which is averaged over some area should have cell methods, and not just rely on names

The following piece of code finds all the files which have the standard_name of surface_temperature and the long_name of OPEN SEA SURFACE TEMP AFTER TIMESTEP. (This is from the one file per field output, but I don't think that's relevant to the problem.)

$files = index['surface_temperature:OPEN SEA SURFACE TEMP AFTER TIMESTEP']
$print(files)
['1m_0h__m01s00i507_2_195001-195001.nc', '1m_12h__m01s00i507_6_195001-195001.nc', 
   '1m_15h__m01s00i507_7_195001-195001.nc', '1m_18h__m01s00i507_8_195001-195001.nc', 
   '1m_21h__m01s00i507_9_195001-195001.nc', '1m_3h__m01s00i507_3_195001-195001.nc', 
   '1m_6h__m01s00i507_4_195001-195001.nc', '1m_9h__m01s00i507_5_195001-195001.nc', 
   '1m__m01s00i507_195001-195001.nc', '3h__m01s00i507_10_19500101-19500110.nc', 
   '3h__m01s00i507_10_19500111-19500120.nc', '3h__m01s00i507_10_19500121-19500130.nc']
$flds = index.get_fields('surface_temperature:OPEN SEA SURFACE TEMP AFTER TIMESTEP',
    'u-cn134-1fpf/19500101T0000Z/')

It then does the aggregation to the two CF-fields that are really in play:

$print(flds)
[<CF Field: surface_temperature(time(241), latitude(324), longitude(432)) K>,
 <CF Field: surface_temperature(time(8), latitude(324), longitude(432)) K>]

These two fields are:

Field: surface_temperature (ncvar%m01s00i507_2)
-----------------------------------------------
Data            : surface_temperature(time(8), latitude(324), longitude(432)) K
Cell methods    : time(8): point within days time(8): mean over days
Dimension coords: latitude(324) = [-89.72222137451172, ..., 89.72222137451172] degrees_north
                : longitude(432) = [0.4166666567325592, ..., 359.5833435058594] degrees_east
Auxiliary coords: time(time(8)) = [1950-01-16 00:00:00, ..., 1950-01-16 21:00:00] 360_day

Field: surface_temperature (ncvar%m01s00i507_10)
------------------------------------------------
Data            : surface_temperature(time(241), latitude(324), longitude(432)) K
Cell methods    : time(241): mean (interval: 900 s)
Dimension coords: latitude(324) = [-89.72222137451172, ..., 89.72222137451172] degrees_north
                : longitude(432) = [0.4166666567325592, ..., 359.5833435058594] degrees_east
Auxiliary coords: time(time(241)) = [1950-01-01 01:30:00, ..., 1950-01-30 22:30:00] 360_day

In both cases there should be a cell method which conforms to the relevant part of the CF conventions. All long names should be checked for such averaging and the appropriate cell methods used.

STASH_to_CF.txt might provide this or can be modified to do so, m01s00i507 has:

0!506!surface_temperature!K!where_land!
0!507!surface_temperature!K!where_open_sea!
0!508!surface_temperature!K!where_sea_ice!

Hi - this is a feature! It's because the time coordinates are encoded as auxiliary coordinates, rather than dimension coordinates.

The solution is to fix the netCDF files, which is already on my list :)

The aggregation rules are different for axes without dimension coordinates, because the dimension coordinates have more restrictions (e.g. dimension coordinates must be strictly monotonically [in|de]creasing).

When I move the time coordinates from auxiliary coordinates to dimension coordinates:

>>> import cf
>>> f = cf.read(['1m_0h__m01s00i507_2_195001-195001.nc', '1m_12h__m01s00i507_6_195001-195001.nc',
...       '1m_15h__m01s00i507_7_195001-195001.nc', '1m_18h__m01s00i507_8_195001-195001.nc',
...       '1m_21h__m01s00i507_9_195001-195001.nc', '1m_3h__m01s00i507_3_195001-195001.nc',
...       '1m_6h__m01s00i507_4_195001-195001.nc', '1m_9h__m01s00i507_5_195001-195001.nc',
...       '1m__m01s00i507_195001-195001.nc', '3h__m01s00i507_10_19500101-19500110.nc',
...       '3h__m01s00i507_10_19500111-19500120.nc', '3h__m01s00i507_10_19500121-19500130.nc'],
...      aggregate=False)
...
>>> for i in f:
...     axis_t = i.domain_axis('T', key=True)
...     aux_t = i.del_construct('T')
...     i.set_construct(cf.DimensionCoordinate(source=aux_t), axes=axis_t)
...

and then aggregate, I get the expected result:

>>> cf.aggregate(f, verbose=2)
Unaggregatable 'surface_temperature' fields have been output: 'time' dimension coordinate ranges overlap: [869400.0, 1722600.0], [1296000.0, 1296000.0]
[<CF Field: surface_temperature(time(1), latitude(324), longitude(432)) K>,
 <CF Field: surface_temperature(time(80), latitude(324), longitude(432)) K>,
 <CF Field: surface_temperature(time(80), latitude(324), longitude(432)) K>,
 <CF Field: surface_temperature(time(80), latitude(324), longitude(432)) K>,
 <CF Field: surface_temperature(time(8), latitude(324), longitude(432)) K>]

This is not perhaps the three fields that you might have imagined, because when faced with an unresolvable ambiguity, it can't do anything along that axis. The ambiguity here is one field (the monthly mean) has time ranges that overlap ambiguously with the short period fields.

The next solution is to not include the monthly mean in the same aggregation command.

I've just remembered that you can use the equal keyword to cf.aggregate to separate fields by property. E.g to group fields which have common values of their interval_write and interval_operation properties you could do:

>>> f = cf.read('*.nc', aggregate={'equal': ['interval_write', 'interval_operation']})

This will then correctly aggregate the monthly means and daily means in one aggregation call.

That said, there is a trivial little bug in the code at the moment that stops this working - but I've fixed it for the next release of cf-python (i.e. very soon).

NCAS-CMS / canari-data

Data which is averaged over some area should have cell methods, and not just rely on names #3