Improve execution speed of `rdtools.degradation_classical_decomposition`

kandersolar commented 1 year ago

[x] Code changes are covered by tests
~[ ] Code changes have been evaluated for compatibility/integration with TrendAnalysis~
~[ ] New functions added to __init__.py~
~[ ] API.rst is up to date, along with other sphinx docs pages~
~[ ] Example notebooks are rerun and differences in results scrutinized~
[x] Updated changelog

rdtools.degradation_classical_decomposition is rather slow for large inputs (~10 seconds for a 6-year daily dataset). The large runtime is caused by two computational bottlenecks: the moving average calculation and the M-K trend test. The current implementations of these calculations use python loops and are straightforward to replace with vectorized pandas/numpy operations. Doing this speeds up the overall rdtools.degradation_classical_decomposition runtime by a couple orders of magnitude.

The following table compares runtimes (values in seconds), along with their ratio, for various input lengths (number of years of daily values).

 years  v2.1.5    PR   ratio
     2   0.717 0.013  53.4
     3   2.560 0.015 169.8
     4   5.992 0.026 226.2
     5   7.445 0.041 182.2
     6  10.614 0.056 190.1
     7  14.840 0.080 184.6

Here is some code to verify that the new implementations produce output equivalent to the current implementations:

MK-test

```python for n in [10, 100, 1000]: # setup x = np.random.rand(n) # current method s = 0 for k in range(n - 1): for j in range(k + 1, n): s += np.sign(x[j] - x[k]) # new method s2 = np.sum(np.triu(np.sign(-np.subtract.outer(x, x)), 1)) assert s == s2 print(s, s2) ```

Moving average

```python for nyears in [2, 3, 4]: # setup times = pd.date_range('2000-01-01', freq='d', periods=nyears*365) noise = np.random.normal(0, 0.1, len(times)) df = pd.DataFrame({ 'energy_normalized': 1 + noise, }, index=times) day_diffs = (df.index - df.index[0]) df['days'] = day_diffs / pd.Timedelta('1d') df['years'] = df.days / 365.0 # current method it = df.iterrows() energy_ma = [] for i, row in it: if row.years - 0.5 >= min(df.years) and \ row.years + 0.5 <= max(df.years): roll = df[(df.years <= row.years + 0.5) & (df.years >= row.years - 0.5)] energy_ma.append(roll.energy_normalized.mean()) else: energy_ma.append(np.nan) df['energy_ma_loop'] = energy_ma # new method: energy_ma = df['energy_normalized'].rolling('365d', center=True).mean() has_full_year = (df['years'] > df['years'][0] + 0.5) & (df['years'] < df['years'][-1] - 0.5) energy_ma[~has_full_year] = np.nan df['energy_ma_pandas'] = energy_ma pd.testing.assert_series_equal(df['energy_ma_loop'], df['energy_ma_pandas'], check_names=False) ```

kandersolar commented 1 year ago

requirements-min is failing. It looks like the necessary pandas functionality was only added in pandas v1.3, release July 2, 2021. Is it okay to bump the minimum version to 1.3? If that's too recent (not quite two years), I could revert the moving average calculation improvement and just keep the M-K test, which would still be a nice runtime improvement.

Also, I took the liberty of making a 2.1.6 whatsnew file for this. Happy to change to whatever the release plan is, or feel free to just push changes yourself :)

mdeceglie commented 1 year ago

I think it makes sense to update the minimum pandas version to 1.3. Looks like #373 needs a more recent minimum version as well.

kandersolar commented 1 year ago

I think it makes sense to update the minimum pandas version to 1.3

Done. As is often the case with increasing minimum versions, it required increasing some others as well.

NREL / rdtools

Improve execution speed of `rdtools.degradation_classical_decomposition` #371