Closed davidorme closed 7 months ago
At the moment, the internal structure is built around being able to wrap the provided observation datetimes into a dates x time of day
view, which useful for then taking daily averages along the times of day axis. Padding out the observations to restore complete days would preserve that, but would mean that values would have to be similarly padded - at first glance that seems like a confusing thing to implement. Having a one dimensional array of sampled indices and then a daily grouping variable seems cleaner?
The approach here, using np.ufunc.reduceat
, looks like it could work well: https://stackoverflow.com/questions/50950231/group-by-with-numpy-mean
In more detail:
This class provides a way to map observations at subdaily scales (e.g. hourly) onto a representative daily value (get_daily_means
) and then interpolate those representative values back onto the subdaily scale (fill_daily_to_subdaily
).
The user creates the class instance by providing a 1D array of datetimes that have to be strictly increasing, evenly spaced and that the spacing in seconds is a factor of (24 60 60) - which is just another way of saying that the daily observation times should be consistent.
At present, those datetimes are expected to start and end on the first and last observation time, so that they can be neatly wrapped onto a 2D array with date and time axes:
one_d = np.array(
["2000-01-01 06:00", "2000-01-01 12:00", "2000-01-01 18:00", "2000-01-01 24:00",
"2000-01-02 06:00", "2000-01-02 12:00", "2000-01-02 18:00", "2000-01-02 24:00"]
)
two_d = np.array(
[["2000-01-01 06:00", "2000-01-01 12:00", "2000-01-01 18:00", "2000-01-01 24:00",
"2000-01-02 06:00", "2000-01-02 12:00", "2000-01-02 18:00", "2000-01-02 24:00"]]
)
time_axis = np.array(["06:00", "12:00", "18:00", "24:00"])
date_axis = np.array(["2000-01-01", "2000-01-02"])
one_d = np.array(
[ "2000-01-01 18:00", "2000-01-01 24:00", "2000-01-02 06:00", "2000-01-02 12:00", "2000-01-02 18:00"]
)
Note
The values being sampled to daily values or filled can by more than 1D but dimension 0 is assumed to be the datetime axis.
Solutions (not an exhaustive list!)
Drop the use of 2D indexing - this probably makes the fill step easier but means the indices for sampling daily means would be the length of the datetime sequence rather than a nicely wrapped slice through an array axis.
Pad the datetimes and then the values being sampled and filled to allow the current indexing to be used with incomplete days.
About option 1, is the main issue to be solved taking the average, or are there other places where it is really handy to have the two axes?
About option 2, you mentioned above that cropping to full days (instead of padding it out) would be "irritating" - is that because of loss of data, or are there other reasons why this is a bad option? Also, where in the code are the values sampled? Do we need to actually pad them out, or could we add an offset parameter to the function call, which would tell us where the real values start , so that we can implement alternative handling for the missing values?
The main convenience of the current method is that you can have a short index along the daily time axis and can conveniently take an average of sampled indices on this axis to collapse to a representative value per day. But it isn't really useful outside this method/class, I think.
The irritation at the moment is that users are forced to crop inputs to whole days. If you're comparing two sites from different longitudes, you often need to correct UTC time to local time, so then having to crop each site to full days is an extra step. There is separately an irritation that there can be data loss. If you are trying to sample an hour around noon and your time series starts at 9 am, then you can perfectly reasonably sample that day.
The workflow for the function is:
get_daily_means
method - this takes the values to be sampled and is where any padding would need to take place. At the moment, this uses logical indices along the daily time axis and then the input data would need padding along that axis to start and end times. Doing the cropping for the users might be the easiest option. However, what I was trying to get my head around is: Do we really have to physically alter the arrays, or could they stay like they are, and the handling be altered? E.g., by providing something like day_start_at
and day_end_at
, and datetime_delta
as input arguments (could potentially even be computed), and then checking the offset on the first and last day and do special case handling there. I might be overthinking this, though.
We don't really want to crop to full days (as is compulsory at the moment) because there is that issue of unnecessary data loss. I think the only way to avoid altering the input arrays is if the indexing is along the whole sequence of datetime - you could then identify which observed times are in the window for each day and then group by a day value (as in that link in the comment above). So something like:
one_d = np.array(
[ "2000-01-01 18:00", "2000-01-01 24:00", "2000-01-02 06:00", "2000-01-02 12:00", "2000-01-02 18:00"]
)
in_window = np.array([False, False, False, True, False])
day = np.array([1,1,2,2,2])
With padding, you would pad values by 2 at the start (missing 06:00
and 12:00
) and one at the end (missing 24:00
) and then you have 8 values that can be wrapped into a daily time dimension of length 4 and two days:
In [18]: x = np.array([[1,2],[3,4],[5,6], [7,8], [9, 10]], dtype='float32')
In [19]: np.pad(x, [[2, 1], [0,0]], constant_values=np.nan)
Out[19]:
array([[nan, nan],
[nan, nan],
[ 1., 2.],
[ 3., 4.],
[ 5., 6.],
[ 7., 8.],
[ 9., 10.],
[nan, nan]], dtype=float32)
I think that second solution is easier to get to from the current code, but don't know what the performance hit is like compared to calculating the indices and grouping by day.
@davidorme , is it likely that the data will be collected in smaller than hour intervals, and/or intervals which are not multiples of hours? The code handels the spacing/timedelta as seconds, but it would be a bit easier in some places to work with hours.
There is definitely half hourly data and I vaguely remember people talking about data with 15 minute intervals. IIRC, I think the critical issue is that np.datetime64
and np.timedelta64
are both stored as integer under the hood, so you can't store fractional hours. That is a deal breaker because it means you can only accurately handle hourly data that happens to be recorded on the hour.
Just to document it here, too: The datetime arrays need to be padded with the valid datetimes, as it otherwise causes issues at various points in the code. The values array can/must be padded with NaNs.
@davidorme , are the values in the value array normally integers or floats? In test_fast_slow_scaler.py
the test arrays are filled with integers. However, integer arrays cannot be padded with NaNs (as NaN is per definition a float).
They should always be float for that exact reason. I thought that was trapped somewhere but it might not be.
I am now looking at the check at https://github.com/ImperialCollegeLondon/pyrealm/blob/90b7548a3bb31d34b67e7cc80ced399b513f5c2b/pyrealm/pmodel/subdaily.py#L200
Does this mean that env
also has to be padded?
Rather than padding another array, it could potentially be better to check fast_slow_scaler.num_missing_values_start
and fast_slow_scaler.num_missing_values_end
.
I am now looking at the check at
Does this mean that
env
also has to be padded?
The env
is the PModelEnvironment
, which contains up to six arrays that might be involved in the subdaily scaling (see here:
That is in a new branch aimed at closing #177 that hopefully gives a much clearer flow of how the scaler fits into the actual subdaily model itself. Basically, all of the env
variables need to be padded to allow a daily value to be calculated across the whole time series. That moves the 'fast' data onto the daily scale. Then the daily values are used to model behaviour back on the fast time scale - and those should be truncated back to the original env
shape.
We don't really want to directly pad the contents of env
by reference - but equally copying all of that data doesn't make a lot of sense.
Describe the bug
The
FastSlowScaler
class expects a whole number of days worth of subdaily observations and throws an error if there are overlaps on either end. That is more or less sensible but is really limiting when dealing with global applications as time series often have to be shifted from UTC to local time. That kind of shift typically leaves incomplete days on either end, which doesn't really matter for applying the model and having to subset the data down to complete days is irritating.