Handling incomplete days at the start and end of time series in FastSlowScaler

davidorme commented 1 year ago

Describe the bug

The FastSlowScaler class expects a whole number of days worth of subdaily observations and throws an error if there are overlaps on either end. That is more or less sensible but is really limiting when dealing with global applications as time series often have to be shifted from UTC to local time. That kind of shift typically leaves incomplete days on either end, which doesn't really matter for applying the model and having to subset the data down to complete days is irritating.

davidorme commented 1 year ago

At the moment, the internal structure is built around being able to wrap the provided observation datetimes into a dates x time of day view, which useful for then taking daily averages along the times of day axis. Padding out the observations to restore complete days would preserve that, but would mean that values would have to be similarly padded - at first glance that seems like a confusing thing to implement. Having a one dimensional array of sampled indices and then a daily grouping variable seems cleaner?

The approach here, using np.ufunc.reduceat, looks like it could work well: https://stackoverflow.com/questions/50950231/group-by-with-numpy-mean

davidorme commented 11 months ago

In more detail:

This class provides a way to map observations at subdaily scales (e.g. hourly) onto a representative daily value (get_daily_means) and then interpolate those representative values back onto the subdaily scale (fill_daily_to_subdaily).
The user creates the class instance by providing a 1D array of datetimes that have to be strictly increasing, evenly spaced and that the spacing in seconds is a factor of (24 60 60) - which is just another way of saying that the daily observation times should be consistent.
At present, those datetimes are expected to start and end on the first and last observation time, so that they can be neatly wrapped onto a 2D array with date and time axes:

one_d = np.array(
    ["2000-01-01 06:00",  "2000-01-01 12:00", "2000-01-01 18:00", "2000-01-01 24:00", 
     "2000-01-02 06:00",  "2000-01-02 12:00", "2000-01-02 18:00", "2000-01-02 24:00"]
)

two_d = np.array(
    [["2000-01-01 06:00",  "2000-01-01 12:00", "2000-01-01 18:00", "2000-01-01 24:00", 
      "2000-01-02 06:00",  "2000-01-02 12:00", "2000-01-02 18:00", "2000-01-02 24:00"]]
)
time_axis = np.array(["06:00",  "12:00", "18:00", "24:00"])
date_axis =  np.array(["2000-01-01", "2000-01-02"])

Those axes are really useful for indexing both daily observations to be used in averaging and across days.
But - the problem is that the expectation that the time series starts and ends on the first and last observation time is really limiting. So users can commonly provide something that looks like this:

one_d = np.array(
    [ "2000-01-01 18:00", "2000-01-01 24:00",  "2000-01-02 06:00",  "2000-01-02 12:00", "2000-01-02 18:00"]
)

Note

The values being sampled to daily values or filled can by more than 1D but dimension 0 is assumed to be the datetime axis.

Solutions (not an exhaustive list!)

Drop the use of 2D indexing - this probably makes the fill step easier but means the indices for sampling daily means would be the length of the datetime sequence rather than a nicely wrapped slice through an array axis.
Pad the datetimes and then the values being sampled and filled to allow the current indexing to be used with incomplete days.

MarionBWeinzierl commented 11 months ago

About option 1, is the main issue to be solved taking the average, or are there other places where it is really handy to have the two axes?
About option 2, you mentioned above that cropping to full days (instead of padding it out) would be "irritating" - is that because of loss of data, or are there other reasons why this is a bad option? Also, where in the code are the values sampled? Do we need to actually pad them out, or could we add an offset parameter to the function call, which would tell us where the real values start , so that we can implement alternative handling for the missing values?

davidorme commented 11 months ago

The main convenience of the current method is that you can have a short index along the daily time axis and can conveniently take an average of sampled indices on this axis to collapse to a representative value per day. But it isn't really useful outside this method/class, I think.
The irritation at the moment is that users are forced to crop inputs to whole days. If you're comparing two sites from different longitudes, you often need to correct UTC time to local time, so then having to crop each site to full days is an extra step. There is separately an irritation that there can be data loss. If you are trying to sample an hour around noon and your time series starts at 9 am, then you can perfectly reasonably sample that day.

The workflow for the function is:

Provide the time series
Set the daily sampling window
Apply the get_daily_means method - this takes the values to be sampled and is where any padding would need to take place. At the moment, this uses logical indices along the daily time axis and then the input data would need padding along that axis to start and end times.

MarionBWeinzierl commented 11 months ago

Doing the cropping for the users might be the easiest option. However, what I was trying to get my head around is: Do we really have to physically alter the arrays, or could they stay like they are, and the handling be altered? E.g., by providing something like day_start_at and day_end_at, and datetime_delta as input arguments (could potentially even be computed), and then checking the offset on the first and last day and do special case handling there. I might be overthinking this, though.

davidorme commented 11 months ago

We don't really want to crop to full days (as is compulsory at the moment) because there is that issue of unnecessary data loss. I think the only way to avoid altering the input arrays is if the indexing is along the whole sequence of datetime - you could then identify which observed times are in the window for each day and then group by a day value (as in that link in the comment above). So something like:

one_d = np.array(
    [ "2000-01-01 18:00", "2000-01-01 24:00",  "2000-01-02 06:00",  "2000-01-02 12:00", "2000-01-02 18:00"]
)
in_window = np.array([False, False, False, True, False])
day = np.array([1,1,2,2,2])

With padding, you would pad values by 2 at the start (missing 06:00 and 12:00) and one at the end (missing 24:00) and then you have 8 values that can be wrapped into a daily time dimension of length 4 and two days:

In [18]: x = np.array([[1,2],[3,4],[5,6], [7,8], [9, 10]], dtype='float32')
In [19]: np.pad(x, [[2, 1], [0,0]], constant_values=np.nan)
Out[19]: 
array([[nan, nan],
       [nan, nan],
       [ 1.,  2.],
       [ 3.,  4.],
       [ 5.,  6.],
       [ 7.,  8.],
       [ 9., 10.],
       [nan, nan]], dtype=float32)

I think that second solution is easier to get to from the current code, but don't know what the performance hit is like compared to calculating the indices and grouping by day.

MarionBWeinzierl commented 10 months ago

@davidorme , is it likely that the data will be collected in smaller than hour intervals, and/or intervals which are not multiples of hours? The code handels the spacing/timedelta as seconds, but it would be a bit easier in some places to work with hours.

davidorme commented 10 months ago

There is definitely half hourly data and I vaguely remember people talking about data with 15 minute intervals. IIRC, I think the critical issue is that np.datetime64 and np.timedelta64 are both stored as integer under the hood, so you can't store fractional hours. That is a deal breaker because it means you can only accurately handle hourly data that happens to be recorded on the hour.

MarionBWeinzierl commented 9 months ago

Just to document it here, too: The datetime arrays need to be padded with the valid datetimes, as it otherwise causes issues at various points in the code. The values array can/must be padded with NaNs.

MarionBWeinzierl commented 9 months ago

@davidorme , are the values in the value array normally integers or floats? In test_fast_slow_scaler.py the test arrays are filled with integers. However, integer arrays cannot be padded with NaNs (as NaN is per definition a float).

davidorme commented 9 months ago

They should always be float for that exact reason. I thought that was trapped somewhere but it might not be.

MarionBWeinzierl commented 9 months ago

I am now looking at the check at https://github.com/ImperialCollegeLondon/pyrealm/blob/90b7548a3bb31d34b67e7cc80ced399b513f5c2b/pyrealm/pmodel/subdaily.py#L200

Does this mean that env also has to be padded?

MarionBWeinzierl commented 9 months ago

Rather than padding another array, it could potentially be better to check fast_slow_scaler.num_missing_values_start and fast_slow_scaler.num_missing_values_end.

davidorme commented 9 months ago

I am now looking at the check at

https://github.com/ImperialCollegeLondon/pyrealm/blob/90b7548a3bb31d34b67e7cc80ced399b513f5c2b/pyrealm/pmodel/subdaily.py#L200

Does this mean that env also has to be padded?

The env is the PModelEnvironment, which contains up to six arrays that might be involved in the subdaily scaling (see here:

https://github.com/ImperialCollegeLondon/pyrealm/blob/1f4624225b31769bfab8ad02f6c3bc8bd9f50ed6/pyrealm/pmodel/new_subdaily.py#L112

That is in a new branch aimed at closing #177 that hopefully gives a much clearer flow of how the scaler fits into the actual subdaily model itself. Basically, all of the env variables need to be padded to allow a daily value to be calculated across the whole time series. That moves the 'fast' data onto the daily scale. Then the daily values are used to model behaviour back on the fast time scale - and those should be truncated back to the original env shape.

We don't really want to directly pad the contents of env by reference - but equally copying all of that data doesn't make a lot of sense.

ImperialCollegeLondon / pyrealm

Handling incomplete days at the start and end of time series in FastSlowScaler #72