[v2.11.1] Fix rounding of Pandas datetimes in ICON CMORizer to allow installing latest Pandas version

valeriupredoi commented 2 months ago

Description

We have been pinning Pandas since at least three minor versions ago, see https://github.com/ESMValGroup/ESMValCore/pull/2394 due to a bug in Pandas that they seem to be taking forever to fix issue https://github.com/pandas-dev/pandas/issues/57002 and now we are yet again faced with the same issue with the new version of Pandas 2.2.3 - see failed nightly GA. The problem is simple: Pandas are simply unable to round to the next minute eg 12H:12M:39S should be 12H:13M:00S (for daily data), so I got fed up and wrote our own rounding function (that took me surprisingly long to get it right - about an hour, rounded to the next minute :rofl: I think this should not affect the result when Pandas finally fix their bug - but I'll let @schlunma decide :beer:

Closes #2530

Before you get started

[x] ☝ Create an issue to discuss what you are going to do

Checklist

It is the responsibility of the author to make sure the pull request is ready to review. The icons indicate whether the item will be subject to the 🛠 Technical or 🧪 Scientific review.

[x] 🧪 The new functionality is relevant and scientifically sound
[x] 🛠 This pull request has a descriptive title and labels
[x] 🛠 Code is written according to the code quality guidelines
[x] 🛠 Changes are backward compatible
[x] 🛠 Any changed dependencies have been added or removed correctly
[x] 🛠 All checks below this pull request were successful - YES, apart from silly pre-commit

To help with the number pull requests:

🙏 We kindly ask you to review two other open pull requests in this repository

codecov[bot] commented 2 months ago

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Project coverage is 94.77%. Comparing base (ffa2347) to head (3b6ce2b). Report is 1 commits behind head on main.

Additional details and impacted files

```diff @@ Coverage Diff @@ ## main #2529 +/- ## ======================================= Coverage 94.77% 94.77% ======================================= Files 249 249 Lines 14094 14095 +1 ======================================= + Hits 13358 13359 +1 Misses 736 736 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

valeriupredoi commented 2 months ago

ayay - check out the devel test with bleeding edge deps:

Mismatched elements: 1 / 1 (100%)
Max absolute difference: 0:00:00.997120
 x: array([cftime.DatetimeProlepticGregorian(2004, 11, 4, 13, 30, 0, 997120, has_year_zero=True)],
      dtype=object)
 y: array([datetime.datetime(2004, 11, 4, 13, 30)], dtype=object)

cftime plugging in some 0.997s out of thin air, they'd get absolutely destroyed if they worked in motorsports :grin:

bouweandela commented 2 months ago

Have you tried this suggestion to work around the issue? https://github.com/pandas-dev/pandas/issues/57002#issuecomment-1906866998

valeriupredoi commented 2 months ago

Have you tried this suggestion to work around the issue? pandas-dev/pandas#57002 (comment)

Yes, it don't work, no Pd rounding works, including the one in the cmorizer 🍺

schlunma commented 2 months ago

This also doesn't sound like a workaround to me...

However, what DOES work is the hint in the description of that issue:

# Round just date works
dates[0].round("30min")
# >Timestamp('2022-01-11 12:00:00')

See here:

>>> datetimes = year_month_day + day_float
>>> print(datetimes)
0   2004-11-04 13:59:59.997120
dtype: datetime64[ns]
>>> rounded_datetimes = pd.Series(dt.round('s') for dt in datetimes)
>>> print(rounded_datetimes)
0   2004-11-04 14:00:00
dtype: datetime64[ns]

schlunma commented 2 months ago

Ahh wait, I just realized that I have an old version of pandas, sorry, will update you again once that is updated...

valeriupredoi commented 2 months ago

You need to update ninja bear to 2.2.3 Manu 😁

schlunma commented 2 months ago

Yes!! It does work with pandas v2.2.3:

>>> import pandas as pd
>>> pd.__version__
'2.2.3'
>>> time_points = [20041104.5833333]
>>> time_str = pd.Series(time_points, dtype=str)
>>> year_month_day_str = time_str.str.extract(r'(\d*)\.?\d*', expand=False)
>>> year_month_day = pd.to_datetime(year_month_day_str, format='%Y%m%d')
>>> day_float_str = time_str.str.extract(
  r'\d*(\.\d*)',...             r'\d*(\.\d*)', expand=False
...         ).fillna('0.0')
>>> day_float = pd.to_timedelta(day_float_str.astype(float), unit='D')
>>> datetimes = year_month_day + day_float
>>> print(datetimes)
0   2004-11-04 13:59:59.997120
dtype: datetime64[ns]

>>> # Old version:
>>> rounded_datetimes = datetimes.round('s')
>>> print(rounded_datetimes)
0   2004-11-04 13:59:59.997120
dtype: datetime64[ns]

>>> # New version:
>>> rounded_datetimes = pd.Series(dt.round('s') for dt in datetimes)
>>> print(rounded_datetimes)
0   2004-11-04 14:00:00
dtype: datetime64[ns]

schlunma commented 2 months ago

You want to plug this in @valeriupredoi? You can simply replace the existing l.496 with

rounded_datetimes = pd.Series(dt.round('s') for dt in datetimes)

and should work 😁

I can also do it if you want, already got a local copy with this change.

valeriupredoi commented 2 months ago

So why is the rounding done in the ICON cmorizer don't work then, and all the attempts I tried two days ago? You mind plopping the correct syntax in this PR, and get rid of the peasant rounding then plss, bud 🍺

valeriupredoi commented 2 months ago

I can also do it if you want, already got a local copy with this change.

That'd be fab, cheers, bud! :beer:

schlunma commented 2 months ago

In the pandas issue they say that only rounding the pd.Series objects fail, i.e.,

rounded_datetimes = datetimes.round('s')

This is what we currently have; hence, it fails.

However, rounding individual elements, i.e.,

rounded_datetimes = pd.Series(dt.round('s') for dt in datetimes)

works. Please don't ask me why...

valeriupredoi commented 2 months ago

that'd be why then - I completely missed the conversion to pd.Series and was trying to round each element of the array, doing exactly what was broken in Pandas :man_facepalming:

bouweandela commented 2 months ago

Did you time some representative test case @schlunma? Looping over every element of an array in Python and then doing something with it can be really slow.

schlunma commented 2 months ago

Did you time some representative test case @schlunma? Looping over every element of an array in Python and then doing something with it can be really slow.

No, but I don't think it's critical here. All ICON output I've seen so far is scattered across many files with only very few time points (sometimes only one time step is stored). Even the high-res output (20 mins) we have only has 72 time points per file.

However, I am open to any better solution. All solutions I could think of involved Python loops in one way or another.

valeriupredoi commented 2 months ago

we can always revert to the original implementation when Pandas finally fix their silly bug. Thanks a lot, Manu! @bouweandela would you be a trooper and merge this, pls :beer:

valeriupredoi commented 2 months ago

However, I am open to any better solution. All solutions I could think of involved Python loops in one way or another.

Well, my peasant approach didn't - it involved a list not an array :rofl:

schlunma commented 2 months ago

Well, my peasant approach didn't - it involved a list not an array 🤣

But you still had a for loop in there 😝

valeriupredoi commented 2 months ago

Well, my peasant approach didn't - it involved a list not an array 🤣

But you still had a for loop in there 😝

An even worse one too 🤣

bouweandela commented 2 months ago

I did some timings (with pandas 2.1.4) and for 100 points the time it takes is quite marginal, even if the new approach is more than 10 times slower:

In [0]: import pandas as pd
   ...: 
   ...: dates = pd.Series([pd.to_datetime("2022-01-11 12:10:01.123")] * 100)

In [1]: %timeit pd.Series(dt.round('s') for dt in dates)
906 μs ± 16 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

In [2]: %timeit dates.round("s")
67.3 μs ± 1.15 μs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

valeriupredoi commented 2 months ago

Many thanks, gents 🍺

ESMValGroup / ESMValCore