MESMER-group / mesmer

spatially-resolved ESM-specific multi-scenario initial-condition ensemble emulator
https://mesmer-emulator.readthedocs.io/en/latest/
GNU General Public License v3.0
22 stars 17 forks source link

replace statsmodels' `AutoReg` with OLS? #483

Open mathause opened 1 month ago

mathause commented 1 month ago

Estimating auto regression params is just an ordinary least regression in disguise (AFAIK). At least as long we don't do any fancy stuff and restrict ourselves to the standard AR(p) processes (no seasonal terms, estimating all the terms, ...). Switching from AutoReg to a linear regression solver can speed up the estimation (or more likely the data preparation) considerably (more than 10x). Of course we have to double and triple check the results. (Also I am a bit afraid that, AutoReg checks for some edge chases we don't know about and that we will need to bells and whistles as soon as we rewrite it, but the speed gain could be worth it).

See similarly #290 and https://github.com/MESMER-group/mesmer/issues/472#issuecomment-2211454603

import numpy as np
import scipy as sp
from sklearn.linear_model import LinearRegression
from statsmodels.regression.linear_model import OLS
from statsmodels.tsa.ar_model import AutoReg

arr = np.random.randn(100)

for i in range(1, arr.size):
    arr[i] = arr[i] + 0.7 * arr[i -1]

def lin_func(x, a, b):
    return a * x + b

%timeit AutoReg(arr, lags=1).fit()
%timeit LinearRegression(fit_intercept=True).fit(X=arr[:-1].reshape(-1, 1), y=arr[1:])
%timeit sp.optimize.curve_fit(lin_func, arr[:-1], arr[1:])[0]
%timeit OLS(arr[1:], arr[:-1]).fit()
%timeit sp.stats.linregress(arr[:-1], arr[1:])
1.27 ms ± 66.1 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
391 μs ± 14.1 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
212 μs ± 4.86 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
172 μs ± 850 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
79 μs ± 3.6 μs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

Also, for lag > 1:

X = np.lib.stride_tricks.sliding_window_view(arr, 2)[:, ::-1][:, 1:]

edit: added OLS

veni-vidi-vici-dormivi commented 1 month ago

Note also statsmodels OLS is in between curve_fitand linregress:

from statsmodels.regression.linear_model import OLS
%timeit OLS(arr[1:], arr[:-1]).fit()
51.3 µs ± 266 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
veni-vidi-vici-dormivi commented 1 month ago

note to self: can also be done for cycle-stationary AR process.

mathause commented 1 month ago

Thanks - I added OLS above (interesting how your laptop is about 3-4 times faster; but it's also newer).