Use of "where" in cell_methods

taylor13 commented 2 years ago

What are the correct cell_methods specifications for the following four cases for characterizing the "age_of_sea_ice" [Let E represent age, A the grid cell area, and s the fraction of the area covered by sea ice. Let i be the grid-cell index and n be the time-sample index for N samples.]:

We want to compute an area-mean (over several grid cells) of the sea ice age in a grid cell, weighted by area covered by sea ice:

sum(over cells) [ s_i A_i E_i ] / sum(over cells) [ s_i * A_i ]

Should cell_methods be "area: mean where sea ice"? Is the weighting by sea-ice area (s_i * A_i) assumed or does a comment need to be included?

We want to compute a time-mean of the sea ice age in a single grid cell, weighted equally across all time-samples:

sum(over time samples) [delta_n * E_n ] / sum(over time samples) [delta_n] where delta_n is a function set to 1 if s_n > 0 and set to 0 if s_n=0.

Should cell_methods be "time: mean where sea ice"? How should omission of the sea-ice free samples be indicated? Is the "where" directive reserved for use only for spatial dimensions?

We want to compute a time-mean of the sea ice age in a single grid cell, weighted by the area of sea ice found in that grid cell at each time sampled:

sum(over time samples) [ s_n * E_n ] / sum(over time samples) [ s_n ] Should cell_methods be "time: mean where sea ice"? Is the weighting by sea-ice area (s_n) assumed or does a comment need to be included? Is the "where" directive reserved for use only for spatial dimensions?

We want to compute a time-mean of the area-means (computed in 1 above), weighting each area-mean by the area occupied by sea ice in that area.

sum (over i & n) [ s_i,n A_i E_i,n ] / sum (over i & n) [ s_i,n * A_i ]

Should cell_methods be "area: time: mean where sea ice" or "area: mean where sea ice time: mean" or "area: mean where sea ice time: mean where sea ice" or "area: mean time: mean where sea ice" or what? Is the weighting by sea-ice area (s_i,n * A_i) assumed or does a comment need to be included?

davidhassell commented 2 years ago

Hi Karl,

Interesting examples! In general, I think that non-standardised comments are the all we currently have at our disposal for cases such as these, but I may have missed a trick.

In that light, here are some suggestions for your four cases (but I make no claim that these are the best options):

In Case 1, as far as I understand it, whether or not a calculation (such as a mean) was weighted is, by default, unspecified. Even though using "where" might suggest that each contributing element represents a different area, this is true in general for cells taken in their entirety. So to indicate that weighted were use we would indeed need "area: mean where sea ice (weighted by sea-ice area)", or any other suitably informative text in the brackets.

In case 2, I think that "where" refers to a portion of the grid cell defined by the "name", and "name" has to refer to spatially defined cells (because the only valid "where"s are area_types). So I wonder if the best we can currently do is another comment: "time: mean (calculated only for times with some sea ice)"

In case 3, similarly to 2: "time: mean (weighted by sea ice area at each time)"

In case 4, "area: mean where sea ice (weighted by sea-ice area) time: mean"

The obvious question is "Do we want/need a more standardised feature to express information about weights?". Thinking ...

taylor13 commented 2 years ago

Thanks @davidhassell . I think your parenthetical descriptions make it clear how the means have been calculated except in case 4. To specify how the time-means are weighted (as you did in distinguishing case 3 from case 2), I think one would need: "area: mean where sea_ice (weighted by sea-ice area) time: mean (weighted by sea-ice area at each time)" or perhaps area: time: mean where sea_ice (weighted by sea-ice area)"

Regarding your last remark, it might be worth thinking about adding a construct similar to but more specific than "where areatype". Perhaps something like:

           <name>: <method> weighted_by <standard_name>

for example for a variable E (and notation as in https://github.com/cf-convention/discuss/issues/173#issue-1340910247 area: mean weighted_by sea_ice_area would be calculated using the formula: sum(over cells) [ s_i * A_i * E_i ] / sum(over cells) [ s_i * A_i ] or area: time: mean weighted_by surface_snow_amount would be calculated using the formula: sum (over i & n) [ a_i,n * A_i * E_i,n ] / sum (over i & n) [ a_i,n * A_i ] where a_i,n is the amount of snow (mass per unit area) found in grid cell i and time sample n.

Not all standard_names that would be commonly needed exist (e.g., land_area is not currently a valid standard_name). So as not to add too many standard names, perhaps the following would be better:

         cell_methods = "<name>: <method> <method of weighting> <type>"

where <type> would serve a purpose similar to the area_type appearing as part of a "where" directive in a cell_methods, and <method of weighting> would be replaced with any one of the following:

area_weighted_by
volume_weighted_by
mass_weighted_by
amount_weighted_by
thickness_weighted_by
weighted_by

For the last option listed (stand-alone weighted_by), <type> would be required to be a standard_name. For all other options <type> could have any value allowed for by the "type" that is part of a "where" directive (see CF conventions section 7.3.3).

The following examples illustrate the variety of weighting accommodated by this more general approach:

area: mean area_weighted_by land
area: time: mean weighted_by atmosphere_absorption_optical_thickness_due_to_ambient_aerosol_particles might be applied to a variable with standard name asymmetry_factor_of_ambient_aerosol_particles.
area: mean weighted_by sea_ice_area and area: mean area_weighted_by sea_ice would be equivalent
area: time: mean amount_weighted_by snow and area: time: mean weighted_by surface_snow_amount would be equivalent

The above examples are different weightings used in producing CMIP6 variables, but the cell_methods assigned to those variables does not in some cases adequately indicate it. So there is a real need to do something. Whether it needs to be done in a standardized way or through a parenthetical comment is what we should first decide.

davidhassell commented 2 years ago

Hi Karl,

I agree that for case 4, area: time: mean where sea_ice (weighted by sea-ice area) would be OK.

On the broader question, I like your ideas, and wonder if it would be good to not roll the "where \<type>" into the weights description, i.e.

cell_methods = "<name>: <method> [where <type>] [weighted_by <weight_type>]"

where _\<weighttype> is a CV of length, area, volume, mass, amount, thickness, duration.

If where is also set then it would act as a modifier to _\<weighttype>. E.g. for area: mean weighted_by area the "area" refers to the whole cell; and for area: mean where sea_ice weighted_by area, the "area" refers only to the area of the cell covered with sea ice.

This seems to fit in with existing use quite nicely, and removes the need for new standard names.

E.g.:

area: mean weighted_by area: Spatial mean weighted by area of each cell: sum (over i) [A_i * E_i ] / sum (over i) [A_i]
area: mean where sea_ice weighted_by area: Spatial mean where sea_ice weighted by area of each cell's sea_ice area: sum(over i) [s_i A_i E_i] / sum(over i) [s_i * A_i]
area: time: mean where sea_ice weighted_by area: Temporal and spatial mean where sea_ice, weighted by area of each (T,Y,X) cell's sea_ice area: sum (over i & n) [ s_i,n A_i E_i,n ] / sum (over i & n) [ s_i,n * A_i ]

taylor13 commented 2 years ago

My first reaction is that your approach, @davidhassell , is better. Thanks for thinking of it. Will give it some more thought and consider the implications for the "over type2" modifier that might also be include in a cell_methods.

JonathanGregory commented 2 years ago

Dear Karl @taylor13 and @davidhassell

Thanks for these questions and the discussion. Are these all actual use-cases, or is this anticipating a need? I would like to suggest that we already have the syntax for these cases, if we clarify or generalise the interpretation a bit.

The text of section 7.3.3 on "Statistics applying to portions of cells" says

A cell_methods attribute with a string of the form "mean where type1 over type2" indicates the mean is calculated by summing over the type1 portion of the cell and dividing by the area of the type2 portion. ... If "over type2" is omitted, the mean is calculated by summing over the type1 portion of the cell and dividing by the area of this portion.

This syntax with "over" is thus a generalisation of "mean where type", which could also be expressed as "mean where type over type". It is calculated by summing over the type portion of the cell and dividing by the area of the type portion. Perhaps we ought to have said "integrating" rather than "summing". It must mean "integrating" because nothing else would make sense. If we are going to sum quantity X over an area and divide by an area, and we want the quotient to have the units of X, the "sum" must have units of X times units of area i.e. it's an area-integral. Hence, I infer that "mean where type" is the area-weighted mean over type. I believe that is what we had in mind and how it's been interpreted up to now, but if I'm right it could be clarified in the text.

Therefore

We want to compute an area-mean (over several grid cells) of the sea ice age in a grid cell, weighted by area covered by sea ice: sum(over cells) [ s_i A_i E_i ] / sum(over cells) [ s_i * A_i ]

is "area: mean where sea_ice". This is consistent with the text above, except that the text speaks of "cells". We ought to rephrase it somehow e.g. with "region", in case the area-mean is aggregating more than one cell, as in Karl's use-case.

For case 2, the current text of 7.3.3 is too restrictive at the outset in saying "the statistical method indicated by cell_methods is assumed to have been evaluated over the entire horizontal area of the cell." It would be more accurate if it said "over the entire extent of the cell in the dimensions involved." This first statement should apply to any dimensions. In particular "time: mean" is assumed to indicate a mean evaluated over the entire extent of the cell in time. I think it's obvious that this is the intended meaning. For precipitation_flux in kg m-2 s-1, the "time: mean" is the precipitation_amount in kg m-2 divided by the duration of the cell in s.

The remainder of 7.3.3 talks only about area but I think it would unproblematic to allow it to apply to other dimensions as well, if we allow area_types to be interpreted as meaning "where or when this type exists". For example, it might be useful for precipitation_flux to have a time: mean where precipitation, if precipitation was an area-type (it isn't at the moment), indicating the precipitation flux meaned over the portion of time when it was not zero. If we allow this interpretation,

We want to compute a time-mean of the sea ice age in a single grid cell, weighted equally across all time-samples: sum(over time samples) [delta_n * E_n ] / sum(over time samples) [delta_n] where delta_n is a function set to 1 if s_n > 0 and set to 0 if s_n=0.

is "area: mean where sea-ice time: mean where sea-ice". First we compute the quantity in the sea-ice portion of the cell, which I suppose might give missing data when there is no sea ice in the cell, then we compute the time-mean of the epochs when there is sea ice present.

We can express both

We want to compute a time-mean of the sea ice age in a single grid cell, weighted by the area of sea ice found in that grid cell at each time sampled: sum(over time samples) [ s_n * E_n ] / sum(over time samples) [ s_n ]

We want to compute a time-mean of the area-means (computed in 1 above), weighting each area-mean by the area occupied by sea ice in that area. sum (over i & n) [ s_i,n A_i E_i,n ] / sum (over i & n) [ s_i,n * A_i ]

as "area: time: mean where sea-ice". The difference between this and the previous case is that the mean is done over both dimensions at once. We compute the double integral ∫∫ X H(X) dA dt over area and time, and divide it by ∫∫ H(X) dA dt, where H(X) is the function that is 1 if the type exists and 0 if it does not. The numerator has units of X times metres times seconds, the denominator has units of metres times seconds, so the quotient has units of X as required. With the double integral, the time-epochs are weighted according to the area of sea-ice at each time, instead of equally weighted in the time-mean. Case 4 is just the same.

Karl's use-case is for age_of_sea_ice. This is a quantity which has no meaning except in sea-ice areas. For a quantity like that, I am not sure we really need "where sea-ice" in case 1, do we? Perhaps we should clarify the convention in this respect. We certainly do need it for any quantity which could exist anywhere e.g. surface_temperature.

Best wishes

Jonathan

davidhassell commented 2 years ago

Dear Jonathan,

This is very interesting! This "... where ... over ..." text has, I presume erroneously, been formatted as an example description rather than main-body text since CF-1.6 - and that's the excuse I'm giving for the fact that I don't recall reading it :) Do you agree that it should be re-instated? If so, I'll raise over at https://github.com/cf-convention/cf-conventions/issues.

Case 1: "area: mean where sea_ice"

Works for me.

For case 2, the current text of 7.3.3 is too restrictive at the outset in saying "the statistical method indicated by cell_methods is assumed to have been evaluated over the entire horizontal area of the cell." It would be more accurate if it said "over the entire extent of the cell in the dimensions involved." This first statement should apply to any dimensions.

I agree

The remainder of 7.3.3 talks only about area but I think it would unproblematic to allow it to apply to other dimensions as well, if we allow area_types to be interpreted as meaning "where or when this type exists". For example, it might be useful for precipitation_flux to have a time: mean where precipitation, if precipitation was an area-type (it isn't at the moment), indicating the precipitation flux meaned over the portion of time when it was not zero.

I agree

Case 2: "area: mean where sea-ice time: mean where sea-ice".

Works for me.

Cases 3 and 4: "area: time: mean where sea-ice"

Works for me.

JonathanGregory commented 2 years ago

Dear @davidhassell

Yes, I agree with you that the text after Ex 7.7 should be "unindented". It is main text, not part of the example. I hadn't noticed. That is a defect which we should correct. The other points on which we are agree are perhaps enhancements, or arguably also defects because the intent of the convention is not clear.

Best wishes

Jonathan

taylor13 commented 2 years ago

Simply being more explicit (and eliminating misinterpretation) as to what the "where" and "over" directives mean, and generalizing them to cover non-spatial dimensions may be all that is necessary. First, to answer some questions raised:

Raising this issue was motivated by re-examining the CMIP6 output specifications, so it is an existing "use-case". It seemed to me that someone preparing CMIP6 output must have had trouble deciding exactly how to compute reported values with the current guidance provided by the CF standards document. I think the cases originally enumerated above, if clarified, would make interpretation straight-forward for CMIP6 output (perhaps with a few exceptions). Note that it is not only the variable, "age of sea ice", that needs to be clarified in CMIP6, but this was used as an example.

Jonathan asked: "For a quantity like that [i.e., one that is only defined where the area_type exists], I am not sure we really need "where sea-ice" in case 1, do we? Yes, I think that if we make clear how one should calculate the statistic in this case, then the where might in some cases become unnecessary. Consider a calculation of the mean_age_of_snow (F) on sea ice when cell_methods is specified as "where sea_ice". Isn't there a danger that a user would calculate this as:

mean = sum(over cells)[s_iA_iF_i]/sum(over cells)[s_i*A_i] where A is the cell area and s is the sea-ice fraction? If so, this would provide an underestimate of the snow age in regions where it actually exists.

What is wanted is:

mean = sum(over cells)[s_iA_iH(s_i)F_i] / sum(over cells)[s_iA_i*H(s_i)] where H(s_i) is 1 if snow exists in the cell and 0 if it does not. Also the mean would be undefined (missing) when the denominator is zero.

I think this should be made clear in the standards document.

JonathanGregory commented 2 years ago

Dear Karl @taylor13

Yes, I agree, this should be clarified. The danger you mention arises because the data-writer is unclear what to assume for a quantity which is only defined for a certain area type in those areas where it's not defined. The data-writer might simply omit them from the mean, as if they were missing data, which is what you want. On the other hand, they would get an underestimate if they assume a value of zero. For age of snow on sea ice I don't think it would make sense to assume zero where there is no sea ice, but it might be done. For depth of snow on sea ice it would arguably be reasonable to assume zero where no sea ice. Certainly we need to be clearer about this.

We could insert a clarification as a new paragraph before Example 7.7. For instance,

Sometimes quantities are meaningful only for particular portions of the gridcell. For example, the quantity with standard name of age_of_sea_ice is defined only where there is sea ice. When a statistic has been calculated over a gridcell of which only part has sea ice, it may be unclear whether the non-sea-ice part has been somehow included. To clarify that the statistic applies only to the relevant part of the gridcell, it is recommended to include a "where type or typevar" specification even for quantities which are defined only for that particular area type.

Would that be sufficient and clear?

Cheers

Jonathan

taylor13 commented 2 years ago

Dear Jonathan,

Yes, I think the suggested text would be very helpful, and the recommendation should be followed by anyone who wants to guard against data being misinterpreted.

I remembered why I thought we might need to add a new qualifier ("weighted_by") to the cell_methods: to distinguish between datasets already written where the weighting may be ambiguous and datasets that will be written under the new, more explicit rules we're now considering for cell_methods. I think in the past, a mean, for example, could have been written with each mean computed from equally weighted samples (rather than weighted by area, as we now propose should be done). A data user won't know (without looking at the conventions attribute, if one is provided) whether "area: mean" implies unambiguously "weighted by area" or not, even though under our present proposal it should by default mean "area-weighted".

Moreover, what if under the new scheme we don't want samples to be area-weighted? Consider a very sparse observational network used to sample some quantity like precipitation rate, where the measurements are known to be statistically independent. Suppose these measurements are reported on a grid (of cells of unequal area). To estimate the mean value for the region, one would likely simply weight each sample equally, without regard to the area of the cell. I think under the current wording of the convention, one would permit this, and a careful data write would include a cell_methods = "area: mean (with each observational site weighted equally)". This would differ from a mean computed from a full-coverage simulated field of precipitation where the mean might better more accurately be calculated with cell_methods = "area: mean", which under the new rules would be unambiguously interpreted as an area-weighted mean. Does the "clarification of cell_methods" we're discussing provide for a mean that is not area-weighted?

There are other weightings possible (such as those described in https://github.com/cf-convention/discuss/issues/173#issuecomment-1218350059). In particular suppose we have a 3-d field reported on an atmospheric grid with altitude as the vertical coordinate. How should a mass-weighted mean be indicated by the cell_methods. For example in computing the mean water vapor mixing ratio, each sample should be weighted by the mass of air in the cell. Should this be indicated in a parenthetical statement or should we indicate it in a more standard way?

cheers, Karl

taylor13 commented 2 years ago

By the way, I support extending "where" to mean "where or when".

JonathanGregory commented 2 years ago

Dear Karl

I agree that it is not clear in cell methods whether weighing of any kind has been applied. This doesn't apply just to means and areal statistics, but is a general point. At the moment, weighting is mentioned only in passing, in sect 7.3.2, where we say, "For instance, an area-weighted mean over latitude could be indicated as lat: mean (area-weighted)". Also in 7.2 (about cell measures), we remark "For instance, in computing the mean of several cell values, it is often appropriate to weight the values by area."

I don't think that we should introduce any new assumption about weighting, but maybe we should make a statement about it near the start of 7.3, and refer to 7.3.2 for the syntax of recording a comment about the weighting. We could recommend that such a comment is included if it might be important information for the user of the data. What guidance would you give?

Best wishes

Jonathan

davidhassell commented 2 years ago

Hello,

I find it confusing that it is unspecified whether or not area-weighting was applied for area: mean, but area: mean where sea_ice is assumed to be area weighted. Have I got that right?

I also support extending "where" to mean "where or when".

JonathanGregory commented 2 years ago

I find it confusing that it is unspecified whether or not area-weighting was applied for area: mean, but area: mean where sea_ice is assumed to be area weighted. Have I got that right?

I think you have got it right, and I agree it's confusing. I believe this reflects the unstated assumption we've always made that means are area-weighted, which we ought to state. Nonetheless alternatives are possible. For example, if several grid-cells are included in the region, area: mean where sea_ice could mean an unweighted average of the value in the sea-ice area of each of the cells. I think it would be reasonable to say that in calculating means or any other statistic where it's relevant it should be assumed that weighting was applied according to the extent of the cell in the affected dimensions, and by area for the area keyword. If it is particularly important to clarify this point, especially if some other sort of weighting was applied, it can be described in a comment.

cf-convention / discuss

Use of "where" in cell_methods #173