Closed pimmeerdink closed 5 months ago
Hi Pim,
Thanks for opening this issue. I am a bit confused that you state:
the problem comes down to the fact that lilio uses intervals which are closed on the right
Because they are not supposed to be closed on the right. They are only supposed to be closed on the left:
https://github.com/AI4S2S/lilio/blob/c0ef954027d9b076714e6cf7b195d5650a5e704b/lilio/calendar.py#L380
https://github.com/AI4S2S/lilio/blob/c0ef954027d9b076714e6cf7b195d5650a5e704b/lilio/calendar.py#L390
To avoid the exact issue that you described.
This means that the interval [2000-12-01 , 2001-01-01) should be equivalent to date >= 2000-12-01 and date < 2001-01-01
.
Perhaps something is going wrong somewhere, looking at the unit tests I am not sure if this is working 100% correctly :thinking:
By the way, square brackets denote a closed interval while round brackets denote an open interval.
Hi Bart,
Thanks for the fast response! Sorry for the confusion, you're right. It should be:
the problem comes down to the fact that lilio uses intervals which are not closed on the right.
This issue described in example 1 and 2 happen when it is not closed right. Hence, we propose to close the intervals on both sides.
I did not mean to let the PR be opened right away, but first await your response. Since it is already opened, I have rewritten the problem statement. The tests of #68 are failing because of linting / docstring tests. We can fix that later, but first let's get the problem statement clear.
OK. It seems that something is going wrong in the checks then.
The function that tries to find out if a date is within a certain interval should be computing this correctly: https://github.com/AI4S2S/lilio/blob/c0ef954027d9b076714e6cf7b195d5650a5e704b/lilio/resampling.py#L112-L133
(note the "less" vs. "less_equal").
Perhaps this is going wrong in some other check, causing a year to be dropped when it should not be (e.g. the calendar map_years
or map_to_data
methods).
[2001-12-01 , 2002-01-01)
should be equivalent to
[2001-12-01 , 2001-12-31]
.
Hi Pim, Sem,
We had a look and did find a small bug in the calendar mapping. That's fixed in #70
However, the behavior you described is actually the intended behavior. In lilio we do not try to infer the bounds for the time coordinates, and thus cannot know for sure what these are.
Let's say you have a dataset with time coordinates separated by 48 hours. The last data point is at 2015-12-30. How can Lilio properly infer that it can map an (open) interval with the right side of 2016-01-01 to this data point? Therefore we check if the final date of the input data exceeds the right bound of the last calendar interval.
To summarize: the problem lies in the fact that timestamps of data usually (but not always...) represent the start of the interval. An interesting change would be to support CF-convention time bounds. Where each time coordinate has a left and right bound. In that case we would be able to determine without ambiguity how to map a calendar to input data.
For your case I'd advise padding the input data with an extra timestamp if you really need that last year of data.
I'm not sure if I follow you.
I agree that, these: [2001-12-01 , 2002-01-01) [2001-12-01 , 2001-12-31]. should be identical, but currently they are not if your data ends in 2001 (while you actually have the data to aggregate to 2001-12-31, it will not because it does not find 2002-01-01).
self._map_year(max_year).iloc[0].right
is currently 2002-01-01
(because of the open interval), which is > self._last_timestamp
which is 2001-12-31
.
Hence, self._map_year(max_year).iloc[0].right > self._last_timestamp
is True, and so max_year will be subtracted with 1, thereby completely dropping the year 2001 (while it does have the data to actually include 2001).
Btw, this test is not testing the issue I am raising here, since the data self._last_timestamp is not precisely at 12-31-2009.
Hope this helps to clarify. Maybe we should call sometime soon to discuss 'in-person'.
Thanks for the assistance and the recent PR @ClaireDons @BSchilperoort!
I agree that, these intervals should be identical, but currently they are not if your data ends in 2001
The problem is not in the interval code, but actually that we need to ensure that:
self._last_timestamp > self._map_year(max_year).iloc[0].right
Let's say you have these two set of timesteps as the index of input data:
timeseries_a = [..., 2001-12-29, 2001-12-30, 2001-12-31].
timeseries_b = [..., 2001-12-14, 2001-12-21, 2001-12-28].
For both you would technically have enough data to fill the interval [2001-12-01 , 2002-01-01)
, but you have to assume or infer somehow that series A is a daily timeseries, and series B is a weekly one. This of can of course only work if you data is regularly spaced (or more specifically, regularly spaced in the Gregorian calendar).
Correctly inferring the frequency of a timeseries can be challenging, and very finicky. For example, if you have a noleap calendar (many GCMs), there would be random gaps in a gregorian calendar. Or if your timeseries is not regulary spaced due to missing data points, etc.
If you had a magical function that could do this, infer_freq
then you could do:
input_data_frequency: pd.Timedelta = infer_freg(input_data)
self._last_timestamp + input_data_frequency > self._map_year(max_year).iloc[0].right
To solve the issue. But to err on the safe side, we currently just make sure that there is a data point that's beyond the right side of the last calendar interval.
The code currently does contain infer_input_data_freq
, but that function takes the smallest step in the calendar to be cautious. Here we would want to actually take the largest step in the calendar if pandas.infer_freq or xarray.infer_freq don't work.
Dear Bart,
Thanks for diving into this issue.
I don't mean to annoy here, but I still believe that our solution is both on the safe side, as well as solving our issue.
Minimal example testing data resolution of "1d" and "5d":
import numpy as np
import pandas as pd
import xarray as xr
import lilio
# Hide the full data when displaying a dataset in the notebook
xr.set_options(display_expand_data=False)
for FREQUENCY in [1, 5]:
n = int(365.25*2 / FREQUENCY)
time_index = pd.date_range(start="2015-01-01", end="2016-12-31", freq=f"{FREQUENCY}d")
time_coord = {"time": time_index}
y = xr.DataArray(np.random.randn(time_index.size), coords=time_coord, name="target")
calendar = lilio.Calendar(anchor="12-01")
calendar.add_intervals("target", length="31d")
calendar.map_to_data(y)
y_resampled = lilio.resample(calendar, y)
print(f"Available years {np.unique(y.time.dt.year)}")
print(y_resampled)
Current implementation output:
Available years [2015 2016]
<xarray.DataArray 'target' (anchor_year: 1, i_interval: 1)> Size: 8B
0.0008229
Coordinates:
* anchor_year (anchor_year) int64 8B 2015
* i_interval (i_interval) int64 8B 1
left_bound (anchor_year, i_interval) datetime64[ns] 8B 2015-12-01
right_bound (anchor_year, i_interval) datetime64[ns] 8B 2016-01-01
is_target (i_interval) bool 1B True
Attributes:
lilio_version: 0.4.2
lilio_calendar_anchor_date: 12-01
lilio_calendar_code: Calendar(\n anchor='12-01',\n allow_ov...
history: 2024-03-21 13:13:31 UTC - Resampled with a L...
Available years [2015 2016]
<xarray.DataArray 'target' (anchor_year: 1, i_interval: 1)> Size: 8B
0.1837
Coordinates:
* anchor_year (anchor_year) int64 8B 2015
* i_interval (i_interval) int64 8B 1
left_bound (anchor_year, i_interval) datetime64[ns] 8B 2015-12-01
right_bound (anchor_year, i_interval) datetime64[ns] 8B 2016-01-01
is_target (i_interval) bool 1B True
Attributes:
lilio_version: 0.4.2
lilio_calendar_anchor_date: 12-01
lilio_calendar_code: Calendar(\n anchor='12-01',\n allow_ov...
history: 2024-03-21 13:13:31 UTC - Resampled with a L...
Our suggested implementation:
Available years [2015 2016]
<xarray.DataArray 'target' (anchor_year: 2, i_interval: 1)> Size: 16B
-0.1375 0.1385
Coordinates:
* anchor_year (anchor_year) int64 16B 2015 2016
* i_interval (i_interval) int64 8B 1
left_bound (anchor_year, i_interval) datetime64[ns] 16B 2015-12-01 2016...
right_bound (anchor_year, i_interval) datetime64[ns] 16B 2015-12-31 2016...
is_target (i_interval) bool 1B True
Attributes:
lilio_version: 0.4.1
lilio_calendar_anchor_date: 12-01
lilio_calendar_code: Calendar(\n anchor='12-01',\n allow_ov...
history: 2024-03-21 13:12:25 UTC - Resampled with a L...
Available years [2015 2016]
<xarray.DataArray 'target' (anchor_year: 2, i_interval: 1)> Size: 16B
0.01442 -0.6337
Coordinates:
* anchor_year (anchor_year) int64 16B 2015 2016
* i_interval (i_interval) int64 8B 1
left_bound (anchor_year, i_interval) datetime64[ns] 16B 2015-12-01 2016...
right_bound (anchor_year, i_interval) datetime64[ns] 16B 2015-12-31 2016...
is_target (i_interval) bool 1B True
Attributes:
lilio_version: 0.4.1
lilio_calendar_anchor_date: 12-01
lilio_calendar_code: Calendar(\n anchor='12-01',\n allow_ov...
history: 2024-03-21 13:12:25 UTC - Resampled with a L...
The core adaptions are:
I know you're a smart guy @BSchilperoort so maybe I'm missing something, but I still don't see it:p
Hi Sem, because of Felix's issue (#74) I now actually have a good idea on how to solve this.
Default mode:
pd.infer_freq
or xr.infer_freq
to estimate the time step between data points (which should work if the data is monotonous)"Greedy mode":
The default mode should be OK for your use case, and the greedy mode should allow users to cover as many years as possible, at the cost of technically not correctly samping the last interval.
To solve #74, the minimum year check should receive the same treatment. (As this issue is just about the max_year).
As an extra addition to Lilio, we could make it compatible with cf-convention like bounds. Then we know the left/right bounds of the data and 100% correctly map the years and resample. However, that might take more time than it's worth.
@ClaireDons should be able to tackle this issue soon.
We have addressed your problems in 2 ways:
calendar.map_to_data(..., safe=False)
, where even "half full" intervals will still be generated.I think that second option calendar.map_to_data(..., safe=False)
would be an alternative solution to the "closing intervals on both sides" solution that I implemented in our forked repo. I will try that out in the future! Would be nice to sync back with the AI4S2S OG repo.
This ticket can be closed. Thanks!
Hey guys,
We have recently discovered a feature of the lilio code that leads to some undesired behaviour, we have implemented a solution and were wondering if it is something you might be interested in incorporating this into the package.
Basically, the problem comes down to the fact that lilio uses intervals which are closed on the right. The problem arises in 2 scenarios:
1 You have an target interval that should include dates up to 31st of December, say the start point is the 1st of December. For example, intervals are defined as [2000-12-01 , 2001-01-01),[2001-12-01 , 2002-01-01) etc. The last interval will be [2015-12-01 , 2016-01-01). When indexing this on our input array we will encounter missing data as we try to index 2016-01-01 but our data only runs until 2015-12-31. Hence, lilio discards the year 2015 as it cannot fit the data to the interval. The obvious solution, then, is to use open indices on both sides: thus our intervals become [2000-12-01, 2000-12-31].
This is what we have implemented in our fork of lilio. What do you guys think?
Note, this fork also includes that you can ask for a 1Y frequency by length="1Y".
Best,
Pim