Closed kandersolar closed 1 year ago
requirements-min
is failing. It looks like the necessary pandas functionality was only added in pandas v1.3, release July 2, 2021. Is it okay to bump the minimum version to 1.3? If that's too recent (not quite two years), I could revert the moving average calculation improvement and just keep the M-K test, which would still be a nice runtime improvement.
Also, I took the liberty of making a 2.1.6 whatsnew file for this. Happy to change to whatever the release plan is, or feel free to just push changes yourself :)
I think it makes sense to update the minimum pandas version to 1.3. Looks like #373 needs a more recent minimum version as well.
I think it makes sense to update the minimum pandas version to 1.3
Done. As is often the case with increasing minimum versions, it required increasing some others as well.
__init__.py
~rdtools.degradation_classical_decomposition
is rather slow for large inputs (~10 seconds for a 6-year daily dataset). The large runtime is caused by two computational bottlenecks: the moving average calculation and the M-K trend test. The current implementations of these calculations use python loops and are straightforward to replace with vectorized pandas/numpy operations. Doing this speeds up the overallrdtools.degradation_classical_decomposition
runtime by a couple orders of magnitude.The following table compares runtimes (values in seconds), along with their ratio, for various input lengths (number of years of daily values).
Here is some code to verify that the new implementations produce output equivalent to the current implementations:
MK-test
```python for n in [10, 100, 1000]: # setup x = np.random.rand(n) # current method s = 0 for k in range(n - 1): for j in range(k + 1, n): s += np.sign(x[j] - x[k]) # new method s2 = np.sum(np.triu(np.sign(-np.subtract.outer(x, x)), 1)) assert s == s2 print(s, s2) ```Moving average
```python for nyears in [2, 3, 4]: # setup times = pd.date_range('2000-01-01', freq='d', periods=nyears*365) noise = np.random.normal(0, 0.1, len(times)) df = pd.DataFrame({ 'energy_normalized': 1 + noise, }, index=times) day_diffs = (df.index - df.index[0]) df['days'] = day_diffs / pd.Timedelta('1d') df['years'] = df.days / 365.0 # current method it = df.iterrows() energy_ma = [] for i, row in it: if row.years - 0.5 >= min(df.years) and \ row.years + 0.5 <= max(df.years): roll = df[(df.years <= row.years + 0.5) & (df.years >= row.years - 0.5)] energy_ma.append(roll.energy_normalized.mean()) else: energy_ma.append(np.nan) df['energy_ma_loop'] = energy_ma # new method: energy_ma = df['energy_normalized'].rolling('365d', center=True).mean() has_full_year = (df['years'] > df['years'][0] + 0.5) & (df['years'] < df['years'][-1] - 0.5) energy_ma[~has_full_year] = np.nan df['energy_ma_pandas'] = energy_ma pd.testing.assert_series_equal(df['energy_ma_loop'], df['energy_ma_pandas'], check_names=False) ```