ImperialCollegeLondon / pyrealm

Development of the pyrealm package, providing an integrated toolbox for modelling plant productivity, growth and demography using Python.
https://pyrealm.readthedocs.io/
MIT License
23 stars 9 forks source link

Handling incomplete days at the start and end of time series in FastSlowScaler #72

Closed davidorme closed 7 months ago

davidorme commented 1 year ago

Describe the bug

The FastSlowScaler class expects a whole number of days worth of subdaily observations and throws an error if there are overlaps on either end. That is more or less sensible but is really limiting when dealing with global applications as time series often have to be shifted from UTC to local time. That kind of shift typically leaves incomplete days on either end, which doesn't really matter for applying the model and having to subset the data down to complete days is irritating.

davidorme commented 1 year ago

At the moment, the internal structure is built around being able to wrap the provided observation datetimes into a dates x time of day view, which useful for then taking daily averages along the times of day axis. Padding out the observations to restore complete days would preserve that, but would mean that values would have to be similarly padded - at first glance that seems like a confusing thing to implement. Having a one dimensional array of sampled indices and then a daily grouping variable seems cleaner?

The approach here, using np.ufunc.reduceat, looks like it could work well: https://stackoverflow.com/questions/50950231/group-by-with-numpy-mean

davidorme commented 11 months ago

In more detail:

one_d = np.array(
    ["2000-01-01 06:00",  "2000-01-01 12:00", "2000-01-01 18:00", "2000-01-01 24:00", 
     "2000-01-02 06:00",  "2000-01-02 12:00", "2000-01-02 18:00", "2000-01-02 24:00"]
)

two_d = np.array(
    [["2000-01-01 06:00",  "2000-01-01 12:00", "2000-01-01 18:00", "2000-01-01 24:00", 
      "2000-01-02 06:00",  "2000-01-02 12:00", "2000-01-02 18:00", "2000-01-02 24:00"]]
)
time_axis = np.array(["06:00",  "12:00", "18:00", "24:00"])
date_axis =  np.array(["2000-01-01", "2000-01-02"])
one_d = np.array(
    [ "2000-01-01 18:00", "2000-01-01 24:00",  "2000-01-02 06:00",  "2000-01-02 12:00", "2000-01-02 18:00"]
)

Note

The values being sampled to daily values or filled can by more than 1D but dimension 0 is assumed to be the datetime axis.

Solutions (not an exhaustive list!)

  1. Drop the use of 2D indexing - this probably makes the fill step easier but means the indices for sampling daily means would be the length of the datetime sequence rather than a nicely wrapped slice through an array axis.

  2. Pad the datetimes and then the values being sampled and filled to allow the current indexing to be used with incomplete days.

MarionBWeinzierl commented 11 months ago
davidorme commented 11 months ago

The workflow for the function is:

  1. Provide the time series
  2. Set the daily sampling window
  3. Apply the get_daily_means method - this takes the values to be sampled and is where any padding would need to take place. At the moment, this uses logical indices along the daily time axis and then the input data would need padding along that axis to start and end times.
MarionBWeinzierl commented 11 months ago

Doing the cropping for the users might be the easiest option. However, what I was trying to get my head around is: Do we really have to physically alter the arrays, or could they stay like they are, and the handling be altered? E.g., by providing something like day_start_at and day_end_at, and datetime_delta as input arguments (could potentially even be computed), and then checking the offset on the first and last day and do special case handling there. I might be overthinking this, though.

davidorme commented 11 months ago

We don't really want to crop to full days (as is compulsory at the moment) because there is that issue of unnecessary data loss. I think the only way to avoid altering the input arrays is if the indexing is along the whole sequence of datetime - you could then identify which observed times are in the window for each day and then group by a day value (as in that link in the comment above). So something like:

one_d = np.array(
    [ "2000-01-01 18:00", "2000-01-01 24:00",  "2000-01-02 06:00",  "2000-01-02 12:00", "2000-01-02 18:00"]
)
in_window = np.array([False, False, False, True, False])
day = np.array([1,1,2,2,2])

With padding, you would pad values by 2 at the start (missing 06:00 and 12:00) and one at the end (missing 24:00) and then you have 8 values that can be wrapped into a daily time dimension of length 4 and two days:

In [18]: x = np.array([[1,2],[3,4],[5,6], [7,8], [9, 10]], dtype='float32')
In [19]: np.pad(x, [[2, 1], [0,0]], constant_values=np.nan)
Out[19]: 
array([[nan, nan],
       [nan, nan],
       [ 1.,  2.],
       [ 3.,  4.],
       [ 5.,  6.],
       [ 7.,  8.],
       [ 9., 10.],
       [nan, nan]], dtype=float32)

I think that second solution is easier to get to from the current code, but don't know what the performance hit is like compared to calculating the indices and grouping by day.

MarionBWeinzierl commented 10 months ago

@davidorme , is it likely that the data will be collected in smaller than hour intervals, and/or intervals which are not multiples of hours? The code handels the spacing/timedelta as seconds, but it would be a bit easier in some places to work with hours.

davidorme commented 10 months ago

There is definitely half hourly data and I vaguely remember people talking about data with 15 minute intervals. IIRC, I think the critical issue is that np.datetime64 and np.timedelta64 are both stored as integer under the hood, so you can't store fractional hours. That is a deal breaker because it means you can only accurately handle hourly data that happens to be recorded on the hour.

MarionBWeinzierl commented 9 months ago

Just to document it here, too: The datetime arrays need to be padded with the valid datetimes, as it otherwise causes issues at various points in the code. The values array can/must be padded with NaNs.

MarionBWeinzierl commented 9 months ago

@davidorme , are the values in the value array normally integers or floats? In test_fast_slow_scaler.py the test arrays are filled with integers. However, integer arrays cannot be padded with NaNs (as NaN is per definition a float).

davidorme commented 9 months ago

They should always be float for that exact reason. I thought that was trapped somewhere but it might not be.

MarionBWeinzierl commented 9 months ago

I am now looking at the check at https://github.com/ImperialCollegeLondon/pyrealm/blob/90b7548a3bb31d34b67e7cc80ced399b513f5c2b/pyrealm/pmodel/subdaily.py#L200

Does this mean that env also has to be padded?

MarionBWeinzierl commented 9 months ago

Rather than padding another array, it could potentially be better to check fast_slow_scaler.num_missing_values_start and fast_slow_scaler.num_missing_values_end.

davidorme commented 9 months ago

I am now looking at the check at

https://github.com/ImperialCollegeLondon/pyrealm/blob/90b7548a3bb31d34b67e7cc80ced399b513f5c2b/pyrealm/pmodel/subdaily.py#L200

Does this mean that env also has to be padded?

The env is the PModelEnvironment, which contains up to six arrays that might be involved in the subdaily scaling (see here:

https://github.com/ImperialCollegeLondon/pyrealm/blob/1f4624225b31769bfab8ad02f6c3bc8bd9f50ed6/pyrealm/pmodel/new_subdaily.py#L112

That is in a new branch aimed at closing #177 that hopefully gives a much clearer flow of how the scaler fits into the actual subdaily model itself. Basically, all of the env variables need to be padded to allow a daily value to be calculated across the whole time series. That moves the 'fast' data onto the daily scale. Then the daily values are used to model behaviour back on the fast time scale - and those should be truncated back to the original env shape.

We don't really want to directly pad the contents of env by reference - but equally copying all of that data doesn't make a lot of sense.