Change "time" value for averaged quantities to midpoint of averaging period

strandwg commented 4 years ago

A longstanding quirk of CAM is that fields saved as averages have the "time" coordinate value of the end of the averaging period. A January monthly average has a timestamp of 1 February, for example. Changing the "time" coordinate value to represent the middle of the averaging period assists our users in understanding CAM output and doing proper analysis.

Fields saved as instantaneous values don't need to be changed at all.

Having one time axis for instantaneous fields and one for averaged fields would allow both kinds of output to reside in the same file, but netCDF-3 doesn't allow for more than one UNLIMITED dimension in a single file. netCDF-4 allows multiple UNLIMITED dimensions, however, pnetcdf is built on netCDF-3, so multiple time axes aren't possible.

gold2718 commented 4 years ago

I have several questions about this issue that can help us process it.

Why is a center time considered more correct? Does this apply to MIN and MAX fields as well, even if that is not when the minimum or maximum value was captured?
If a center time is desired, the average of the time_bnds variable for that time frame is the correct value so the data is already on the file. Why is this not a solution?
Why is another dimension needed?
- If we want a variable that describes the middle of a time-averaged field, why can't we just define a new variable, center_time (or similar name)? Both time and center_time will be defined using the time dimension, as does time_bnds now.

brian-eaton commented 4 years ago

I take offense (not really:) to the suggestion that CAM's time coordinate is quirky. It is completely in conformance with the CF Metadata conventions!

If the timestamp of a time interval is desired to be at the interval midpoint, then the average of the interval endpoints is a value, say '15.5 days since 1850-01-01 00:00:00', which may then be converted to a calendar representation, or not, if say the time coordinate is used for plotting relative values along a time axis. Typically the thing that causes the biggest headache has been converting to a calendar representation. That's because standard tools for doing this kind of conversion don't recognize a 365-day year. I haven't kept up with the postprocessing toolchains, but surely by now this is no longer a problem? Bottom line: I'm agreeing with Steve's comment: Why is this not a solution?

I also agree with Steve's comment on the unlimited dimension. There only needs to be one unlimited dimension and it should be used for any alternate time axis, e.g., center_time. The only issue here is that center_time(time) is not a coordinate variable (in the NetCDF conventions sense) and so connecting it to the time_bnds from which it is derived would in fact be "quirky", that is there's no convention for doing that. So it would be a CAM specific thingy that our postprocessing tools could be hardcoded to recognize, but nothing else would recognise it. I don't recomment this solution.

strandwg commented 4 years ago

1) CMIP3/CMIP5/CMIP6 have all required that the "time" value for averaged quantities be the midpoint of the period, with the "time_bnds" values representing the endpoints of the averaging period. Many of our users are familiar with MIPn data, and expect that CAM output is the same. I've gotten many emails over the years asking about this exact issue.

Fields that are MIN and MAX should have their time value represent the period over which the extrema has been found, analogous to an average. I don't think CAM has ever kept track of which timestep an extreme was found.

2) Averaging "time_bnds" is what's done for the translation from CESM output to MIP-compliant format, and I know at least a few users who've written their own software to do that average and realign their data to the new time axis. CAM can do that natively and thus prevent the issue.

3) If averaged and instantaneous fields were forced to be in different output streams, then an additional time dimension wouldn't be needed, but CAM allows outputting both instantaneous and averaged fields in the same stream. The "time" dimension is overloaded in a sense by using the same single axis to attempt to serve both kinds of fields.

brian-eaton commented 4 years ago

There is a lot of postprocessing that needs to be done to produce files that are in compliance with CMIP. And there is no issue in that case about putting both instantaneous and time averaged fields into a single file (as far as I know).

I think having CAM produce CMIP compliant output directly would be a substantial task which is far better left to postprocessing. If CAM isn't going to produce CMIP compliant output is there really a significant gain in going to the partway solution of splitting CAM's instantaneous and time interval based output into separate streams? What we're doing now is simpler.

I agree with your point about MIN and MAX fields. CAM has never kept track of the timestep where the extreme value occurs and the mid-point time is no more correct than the endpoint in thas case.

If the argument is made that the mid-point time coordinate is scientifically more correct than the endpoint then that could be a valid reason for adding the complication of separate output streams for instantaneous and time interval based output. I think that's really the discussion that the scientists need to have. From an SE perspective there is no real difficulty in doing this. Just a bit of work to separate out the currently mixed fields to create separate streams, and adding some checking code in CAM to prevent mixed output in the future.

strandwg commented 4 years ago

Basically, CAM is overloading "time" by using one axis to represent both averaged and instantaneous values. Since CESM uses pnetcdf, and pnetcdf is built on netCDF-3, CESM is allowed only one unlimited dimension in a file. If CESM could use netCDF-4, then additional unlimited time axes could be defined - one for instantaneous values, one for averaged values, in the same file.

I'm asking that CAM (and CESM as a whole) write MIP compliant output directly, since CESM now dumps out so much data that keeping copies (CESM output and CMIP compliant format) of the same data is no longer practical and makes managing CESM data much more difficult.

brian-eaton commented 4 years ago

Let's not mix up this issue with direct output in MIP format. That's a much larger task than dealing with the time coordinate. I think you should open another issue if you want to keep that on the radar. My initial thought is that the ongoing exercise by CAM to upgrade its postprocessing toolchain should address this issue. If our tools dealt with the MIP format then there would be no reason to save the original history files once the conversion was done. Obviously though not all CAM history output would need to be converted to MIP format and ideally the new tools would deal with either format.

strandwg commented 4 years ago

Yes, MIP format is another issue. Sorry to inject it here.

gold2718 commented 4 years ago

Fields that are MIN and MAX should have their time value represent the period over which the extrema has been found, analogous to an average.

I'm not even sure how that would work given that the minimum or maximum value is per grid box which means that the capture could represent hundreds of different times in a one-month sample (up to (number of grid cells * number of levels) for longer intervals) . What am I missing?

brian-eaton commented 4 years ago

Good point. Clearly the time coordinate being at the end of the interval is as good a choice as any in this case. The midpoint would also be as good a choice as any if we end up resolving this issue by making that change for the sake of the time averaged variables.

strandwg commented 4 years ago

Fields that are MIN and MAX should have their time value represent the period over which the extrema has been found, analogous to an average.

I'm not even sure how that would work given that the minimum or maximum value is per grid box which means that the capture could represent hundreds of different times in a one-month sample (up to (number of grid cells * number of levels) for longer intervals) . What am I missing?

Nothing - you're correct. A min or max over a month is over many timesteps - thousands in a month.

strandwg commented 4 years ago

I've been thinking and I believe using a meaningful value for "time" for a temporal average is better than using the instantaneous value for "time" - even for instantaneous output.

I've not often seen instantaneous output in the monthly-mean stream, but if there is, obviously all that's wanted is just a value, and it doesn't matter when it happened.

Actually, the same thing applies to daily output, 6-hourly output, 3-hourly output, etc. What's really wanted is the preservation of variability, which is important for fields like precipitation and a few others. The exact value for "time" doesn't matter, and any value is as good as any other, and just so there are 365 values per year, one day apart, for daily data, 365*4, 6 hours apart, for 6-hourly data, etc. "time_bnds" has no meaning.

For averages, however, the value for "time" does matter. By convention (and by de facto standard in MIPs), the value for "time" of an average is the midpoint of the times, with "time_bnds" representing those endpoints. Makes perfect sense given the limitations of netCDF's representation of spans of time.

By using the instantaneous value for "time" for average data (especially monthly-mean output), we just confuse users and lose meaning to our data.

Unless I've overlooked something, and I'm entirely wrong.

gold2718 commented 4 years ago

@strandwg, Thanks for the discussion, we will run this by the AMP scientists to see if keeping mixed fields (instantaneous fields in an averaged file) is really important to anyone or if we should try and phase them out.

strandwg commented 4 years ago

@gold2718, Thanks.

swrneale commented 4 years ago

I've always felt that variables were described correctly in non monthly mean files., i.e. 'time' describes the instantaneous time step for an instantaneous field. For a time averaged variable, the meta data of 'time_averaged' attached to the variable tells you the averaging period is in time_bnds.

Monthly averages are messy because the file naming is more informative for the time averaging obviously. I have just always resisted the center time thinking for the time stamp. 31-day months give you 12Z on the 16th day, which just feels clunky to me.

brian-eaton commented 4 years ago

I've been thinking and I believe using a meaningful value for "time" for a temporal average is better than using the instantaneous value for "time" - even for instantaneous output.

I've not often seen instantaneous output in the monthly-mean stream, but if there is, obviously all that's wanted is just a value, and it doesn't matter when it happened.

Actually, the same thing applies to daily output, 6-hourly output, 3-hourly output, etc. What's really wanted is the preservation of variability, which is important for fields like precipitation and a few others. The exact value for "time" doesn't matter, and any value is as good as any other, and just so there are 365 values per year, one day apart, for daily data, 365*4, 6 hours apart, for 6-hourly data, etc. "time_bnds" has no meaning.

For averages, however, the value for "time" does matter. By convention (and by de facto standard in MIPs), the value for "time" of an average is the midpoint of the times, with "time_bnds" representing those endpoints. Makes perfect sense given the limitations of netCDF's representation of spans of time.

By using the instantaneous value for "time" for average data (especially monthly-mean output), we just confuse users and lose meaning to our data.

Unless I've overlooked something, and I'm entirely wrong.

I don't agree.

Consider an example of yearly averaged TS (surface temperature) data. Let's say time_bnds=[0,365] (days since 1850-01-01) and time=182.5. The annual average field doesn't look like any actual state of TS, so why is using a time coordinate of July 1 more correct than Dec 31 (MIP convention aside)? In fact the only relevent time information in this example comes from the bounds which tells you it's an annual average for 1850.

Now consider instantaneous once yearly output of TS. Again let time_bnds=[0,365] and time=182.5. You claim it doesn't matter what the instantaneous time value is, but now we have a field which is clearly NH winter labeled with the time July 1.

I'm afraid we can't solve the problem of confused users, but the time coordinate convention used by CAM does not sacrifice any meaning of the data.

My conclusion is that the only reasonable way to move the time coordinate value to the interval midpoint is to not allow instantaneous fields in files that contain interval based fields.

billsacks commented 4 years ago

I've not often seen instantaneous output in the monthly-mean stream, but if there is, obviously all that's wanted is just a value, and it doesn't matter when it happened.

@strandwg are you suggesting outputting instantaneous values at the central time of the given output file rather than the end time? That's something I was wondering about, but I wasn't sure whether or not it would be scientifically acceptable. I couldn't tell if that's what you're suggesting.

brian-eaton commented 4 years ago

The file naming issue that Rich brought up is separate from the time coordinate issue. We could with a bit of work adjust the filenames to contain dates that we think are appropriate.

strandwg commented 4 years ago

I've been thinking and I believe using a meaningful value for "time" for a temporal average is better than using the instantaneous value for "time" - even for instantaneous output. I've not often seen instantaneous output in the monthly-mean stream, but if there is, obviously all that's wanted is just a value, and it doesn't matter when it happened. Actually, the same thing applies to daily output, 6-hourly output, 3-hourly output, etc. What's really wanted is the preservation of variability, which is important for fields like precipitation and a few others. The exact value for "time" doesn't matter, and any value is as good as any other, and just so there are 365 values per year, one day apart, for daily data, 365*4, 6 hours apart, for 6-hourly data, etc. "time_bnds" has no meaning. For averages, however, the value for "time" does matter. By convention (and by de facto standard in MIPs), the value for "time" of an average is the midpoint of the times, with "time_bnds" representing those endpoints. Makes perfect sense given the limitations of netCDF's representation of spans of time. By using the instantaneous value for "time" for average data (especially monthly-mean output), we just confuse users and lose meaning to our data. Unless I've overlooked something, and I'm entirely wrong.

I don't agree.

Consider an example of yearly averaged TS (surface temperature) data. Let's say time_bnds=[0,365] (days since 1850-01-01) and time=182.5. The annual average field doesn't look like any actual state of TS, so why is using a time coordinate of July 1 more correct than Dec 31 (MIP convention aside)? In fact the only relevent time information in this example comes from the bounds which tells you it's an annual average for 1850.

Only if we adhere to a convention (or requirement?) that for averaged fields, the value of "time" represents the midpoint of the averaged period does my argument hold up. In any case, it doesn't make sense to me to have a value for "time" for an averaged field to be anything other than the midpoint of the averaged period.

Now consider instantaneous once yearly output of TS. Again let time_bnds=[0,365] and time=182.5. You claim it doesn't matter what the instantaneous time value is, but now we have a field which is clearly NH winter labeled with the time July 1.

Good point. In the limit (one instantaneous sample over a year), the "time" value is important. For shorter output periods, "time" isn't as important. Usually instantaneous output uses short time intervals - 6-hourly or shorter.

I'm afraid we can't solve the problem of confused users, but the time coordinate convention used by CAM does not sacrifice any meaning of the data.

Given that many users over a long time have been puzzled by monthly average time values, I think there is a way to adjust the time convention to make it more understandable.

My conclusion is that the only reasonable way to move the time coordinate value to the interval midpoint is to not allow instantaneous fields in files that contain interval based fields.

That's one solution.

sherimickelson commented 4 years ago

Going forward, we're seeing more intercomparison projects. At the same time, several modeling centers are adopting the standardizations defined for MIP's in order to meet requirements to publish to ESGF and so it is easier to compare data with other centers. As far as I can tell, we are the outlier in how we use the time dimension and it makes our data more difficult to use. Yes, if you are aware of the way we overload time you can figure out the correct time and adjust the data, but we are the only center that requires these steps. Also, this requires multiple copies of the data, and with disk space as a premium, this will become harder to justify this practice in the future.

The way we use time makes it more challenging to do analysis. This is because stand alone tools and Python libraries, like xarray, just use the time values and they do not check for a cell_methods attribute, then look at the bounds attribute on time, and then average the time bounds variable. Another thing that makes it difficult is that the time bounds variable is named something different for each component. CAM uses time_bnds, CLM and CICE both use time_bounds, and POP uses time_bound. This creates several extra steps to evaluate our data. If our data was "correct" we wouldn't have to do these steps and this is what we're in favor for. Making our data easier to be used by us and the community.

If we want to avoid confusion and make analysis easier, the time dimension should reflect the correct time values that the data represents. The only way to do this correctly with netcdf3 is to have averaged fields in a different file than the instantaneous values. Would it be possible to create multiple variants of an output stream? For example, h0.avg and h0.inst (and I'm not sure what to do for min and max).

brian-eaton commented 4 years ago

My takeaway from the AMP meeting this morning is that there isn't any pushback on separating the instantaneous fields into different streams from those for fields based on time intervals.

I am a bit surprised at what seems to be the difficulty of implementing software tools that understand the CF Metadata conventions. The reason the components can use different names for the time bounds variable is that the conventions allow it. As you point out the bounds attribute points to that variable. I don't see why we can't build tools that understand CF even if xarray doesn't. I don't know xarray, but I assume it at least understands the basic things like NetCDF coordinate variables and attributes. If that's the case it shouldn't be hard to build a layer on top of that to interpret the CF semantics for those attributes. Building smart tools would make them useful to folks beyond the group analyzing MIP conforming datasets.

sherimickelson commented 4 years ago

@brian-eaton The problem is that we all have our own tools that handle this and we impose this burden on the community and anyone else who wants to use our data to do the same. Yes it can be done, but this leads to the lack of usability of our data out-of-the-box. Since we're the only modeling center that uses time in this matter, software tools and language libraries don't make an exception just for us and we are always imposing our users to create work arounds.

brian-eaton commented 4 years ago

I'm missing something. The CF Metadata Convention is widely used. Our files conform to that standard. If you build smart tools other people can use them too. Any software that knows how to interpret CF is supporting a large user community, not just making an exception for us.

strandwg commented 4 years ago

CESM follows the CF metadata conventions in a limited sense - I don't believe any data have the "standard_name" attribute, just "long_name", for example.

phillips-ad commented 4 years ago

I'm a bit late to this thread, but just to add my voice to the conversation: I wouldn't say that the time coordinate used for monthly component output is incorrect, and I agree it conforms to CF conventions. I see why CAM's time coordinate for monthly data was set up the way it is, as it allows instantaneous/averaged data to have the exact same time value. Makes sense. The issue is two-fold: 1 - The monthly time convention is confusing for users who read in the data for the first time. At a recent section meeting almost everyone there stated that this issue caused them problems when they started to use CESM data (myself included), and one person admitted to having one publication where the data used/plotted was a month off. (This is not the first time I have heard of that happening.) A university professor said it trips up most of her students when she starts teaching them how to analyze CESMl data. When I go over this topic at the CESM Tutorial the common refrain of students is: Why is the time (when translated) off by a month? In the end, by having the monthly time coordinate be at the end of the averaging period, we are making our data harder to use (correctly). Yes, everything is in the file to assist the user in deducing the correct time (via attributes), but the user shouldn't have to go beyond the time variable, nor do they expect to as we're the only modeling institution that sets the monthly time this way.

2 - As Sheri has said, I have never heard of a tool than looks beyond the time coordinate variable when translating the time, and developers will not make an exception for the one modeling center that sets their time this way. When folks read in CESM monthly timeseries data in NCL, IDL, python, etc, and convert the time, it will come back as 00Z of the following month. Yes, you can convert the time, but I guess my point is that the users shouldn't have to.

The solution of separating instantaneous and average/min/max outputs into 2 different streams could be the way to go. Although as Gary has stated, as it doesn't matter when the value is taken, instantaneous values could be taken mid-month, and one stream could be kept. The key is for time to be centered within the averaging period for averaged variables.

kmpaul commented 4 years ago

I'm late to this issue, too, and to make my statements seem to have even less credibility, I'm not a CAM developer or user! However, I do work on analysis and post-processing tooling, including the Pangeo Python software stack @sherimickelson mentioned above.

I think that this issue is more about a lack of standardization across model output (i.e., all models, as well as across components in the same model), and it is less about whether CAM's time coordinate is wrong. (I think we should just assume it is right.) Conventions, like CF, are not standards. Conventions allow different model developers lots of freedom to decide how much metadata they want to add to their file, while standards require a given set (no more, no less) of metadata in each file. MIP "requirements" are more like standards than anything that currently exists.

My perspective, as a tools developer, tells me that more and more people want to compare output from different models, as well as analyze complex interactions between components of the same model. As @phillips-ad points out, CAM's time coordinate creates headaches for many of people. Some of those headaches are due to analysis tools not fully implementing the CF conventions, and some of those headaches come from the fact that CAM's time coordinate is just different from what many users expect. ...And I believe that that expectation can be seen as a de facto standard. I also believe that conforming to that standard will only make CESM/CAM data more useful to more people in the long run.

As for how to implement it, I leave that to the experts. And I hope my interjection isn't seen as off-topic, inflammatory, or generally distracting. I think this is actually an important issue to resolve. Probably in part because I'm one of the people who gets to deal with users' complaints about "Why is the time (when translated) off by a month?", as @phillips-ad quotes. 😄

gold2718 commented 4 years ago

@phillips-ad & @kmpaul, I do not think you are late to the conversation, I think it is just getting started. Any changes we are contemplating are likely to be targeted at CESM2.3 (unless the co-chairs think it needs to be in a CESM2.2 maintenance release). I think it is important to take the time to try and get this right.

brian-eaton commented 4 years ago

I did misspeak above referring to CF as a standard. Standards tend to be heavyweight and rigid things, the CF convention is lightweight and allows modelling groups lots of flexibility for local conventions, for example in choosing variable names. I think we've hit a good balance by using postprocessing to convert CAM history files to MIP conforming files. Producing output for MIPs is not the only thing CESM is used for, and adopting the MIP "standard" for all CAM output would not be the most convenient for many CAM users. That said I think we've arrived at a reasonable compromise for CAM to use separate output streams for instantaneous variables which allows the time coordinate to be set to midpoints for interval output. Also, since AMP is in active discussion about postprocessing tools to replace the current AMWG diagnostics package, this is an excellent time to think about tools that are flexible enough to process either MIP or CAM history files. That way any history files that were converted to MIP format could be deleted.

strandwg commented 4 years ago

@brian-eaton, I think we're reaching a consensus in that instantaneous fields and averaged fields will be forced to be in different output streams and can thus have different time axes.

So, how would we differentiate between the instantaneous stream and averaged streams? Leave the "cam.h0", "cam.h1", etc. alone and use a different convention? Perhaps "cam.hi" for instantaneous output?

billsacks commented 4 years ago

For what it's worth – and it may be worth something if we care about consistency between components: In CLM, I think we'll end up having multiple history streams that contain instantaneous variables – so it wouldn't work to just have a single clm.hi. Furthermore, from a preliminary analysis, it appears that, in common use cases, some of these instantaneous streams will NOT have average counterparts (i.e., there will be particular sets of time frequency and subgrid averaging options for which we ONLY have instantaneous variables, not time-averaged variables). So in my view it makes sense to think about these as completely separate streams with consecutive numbering. Global metadata in the file could indicate whether the file contains time-averaged or instantaneous quantities.

brian-eaton commented 4 years ago

@billsacks, I agree. CAM currently allows 12 independent output streams. I would anticipate just doubling that number to start and using the same numbering scheme we currently do, with h0 being the monthly average output. But that of course is TBD. We decided at the AMP meeting this morning that a crack team would assemble to start working out a refactored design.

billsacks commented 4 years ago

It would probably be good if either @ekluzek or I could be included in the high-level discussions that touch on user interface: I think we'll want CLM to remain consistent with CAM if possible, so it would be good to ensure that any decisions made on the user interface side are going to work for CLM, too.

klindsay28 commented 4 years ago

In addition to user specification of output fields via fincl, CAM adds a handful of fields to all history files.

I'm pretty sure that some of these, e.g., co2vmr, ch4vmr, sol_tsi github, are instantaneous values from the timestep when the file is written.

So an effort to separate instantaneous quantities into separate streams and set the time coordinate to time interval midpoints in non-instantaneous streams should consider what to do for these fields.

billsacks commented 4 years ago

Our current (still tentative) thinking in CTSM (https://github.com/ESCOMP/CTSM/issues/1059) is this: Currently, there are two ways to specify the averaging characteristics of history fields: (1) the field-specific :A, :M, :X and :I, and (2) the file-specific hist_avgflag_pertape. We're thinking of simply dropping support for :I, so that only mechanism (2) could be used to create instantaneous diagnostics. Along with this, we would:

Change the time values on a non-instantaneous file to specify the center, rather than end, of the averaging period
Figure out what to do with the very small number of fields that are currently instantaneous by default

gold2718 commented 4 years ago

In CAM, we also can specify the field handling in the addfld call and in the add_default call so we can have a mess even without ':I' :(

phillips-ad commented 2 years ago

As @gold2718 brought up in https://github.com/ESCOMP/CAM/issues/554#issuecomment-1089370672, the date and datesec variables will need to be aligned with the new centered time for the :A :M :X field streams.

dabail10 commented 1 year ago

Love this discussion. I am now just looking into this for CICE. What about the filenames themselves? Currently they contain a date string YYYY-MM-DD-SSSSS. This will have to change if the time axis changes. I guess, just playing devil's advocate here, if we have to postprocess to create single variable time series, why can we not modify the time axis at that point?

strandwg commented 1 year ago

Which filenames have YYYY-MM-DD-SSSSS? Are you talking about CICE or CAM?

The information needed to write out time at the midpoint of the interval is at hand in the model; since postprocessing is a separate step that would require the transposition code to do calculation, which it can't. It slices and reassembles slabs, and makes no changes to data values.

dabail10 commented 1 year ago

Sorry, I wasn't clear. The monthly files are only YYYY-MM, the daily are YYYY-MM-DD and so on. However, it means you have to convert seconds since into the YYYY, MM, etc. Just another step. POP currently sets the daily files as YYYY-MM-01 at the beginning of the averaging period. So, are you saying we don't have to change the filenames?

phillips-ad commented 1 year ago

@dabail10 that is a great question. My thought is that as long as the file name reflects a time that is within the averaging period of a timestep on the file, then no, the file names do not need to change. While it would be nice if every component was consistent in terms of how they handle file naming, I do not believe it is a high priority to make it so. It'd be curious to hear others thoughts on this though.

slevis-lmwg commented 1 year ago

CTSM update: I have started working on the land equivalent of this issue and have opened this PR. This far I have only worked on separating the instantaneous fields from all other history fields.

ESCOMP / CAM

Change "time" value for averaged quantities to midpoint of averaging period #159