Rolling OLS inconsistent result

AdrienDart commented 7 months ago

Hi,

Thanks for your constant help!

I am currently trying the rolling regression and I got different results from polars_ds (and Rolling OLS).

Example:

df = pl.DataFrame({'x': [0.1, 0.2, -0.1, 0.1], 'y': [-0.1, 0.2, 0.1, 0.1]})

mdl= RollingOLS(df['y'].to_numpy(), df[['x']].to_numpy(), window=2, min_nobs=2, expanding=2).fit() mdl.params

[NaN, 0.6, 0.6, 0]

df.select(col('y').least_squares.rolling_ols(col('x'), add_intercept=False, mode='coefficients', window_size=2, min_periods=2, null_policy='ignore'))

[{0.0}, {0.6}, {0.333333}, {-0.333333}] (The 2 last values are not the result I'm expecting)

In my real use case, I have a window > 50, and still 1 feature.

Let me know if I'm missing something obvious.

Thanks!

azmyrajab commented 7 months ago

Hi @AdrienDart, thanks for using the package and documenting this issue I'm sorry, this is not intended behaviour !

Just checked why my test did not catch it - the issue lies in transitioning from warm-up (in min_periods) to actual rolling. I think I may be off by one data point (skip one data point at the transition of min_periods, min_periods+1) which is why for larger window sizes > min_periods it ends up not coming up.

Thanks for catching this - should be relatively quick to fix. will keep you posted

    df = _make_data()
    with timer("rolling ols"):
        coef_rolling = (
            df.lazy()
            .select(
                pl.col("y")
                .least_squares.rolling_ols(
                    pl.col("x1"),
                    pl.col("x2"),
                    mode="coefficients",
                    window_size=252,
                    min_periods=2,
                )
                .alias("coefficients")
            )
            .unnest("coefficients")
            .collect()
            .to_numpy()
        )
    with timer("rolling ols statsmodels"):
        mdl = RollingOLS(
            df["y"].to_numpy(), df[["x1", "x2"]].to_numpy(), window=252, min_nobs=2, expanding=True
        ).fit()
    assert np.allclose(coef_rolling[1:], mdl.params[1:].astype("float32"), rtol=1.0e-3, atol=1.0e-3)

AdrienDart commented 7 months ago

And another question for you, are the same null_policies available for the rolling_ols? I believe the default is not 'drop' for that one.

azmyrajab commented 7 months ago

Hi! I'm sorry right now rolling OLS does not yet support any clever handling of nulls (simply because I didn't get around to it yet); "ignore" right now here means literally it does nothing at all (and should break if there are nulls) and so it only knows to zero data out, which I know is not great.

It was my on my to do list (now after I fix the little bug with window sizing in this issue) is to implement the "drop" policies for RLS and rolling. What that'll do is effectively skip over null rows and propagate the coefficients (forward-fill previous state) in those situations; and the window if set to "50" will effectively span the last 50 valid samples skipping over nulls [as if you dropped, did rolling ols, then re-aligned to the original index with forward fill] I assume this is what you would be after too?

What "ignore" will do after that change, is instead of break, for past windows which contain nulls it will ignore them and produce nulls whereas if the past window contains all valid observations it will produce a value.

That should make it behave similar to statsmodels behaviour (with their "skip" mapping to our "ignore"), their description below:

    Available options are "drop", "skip" and "raise". If "drop", any
    observations with nans are dropped and the estimates are computed using
    only the non-missing values in each window. If 'skip' blocks containing
    missing values are skipped and the corresponding results contains NaN.
    If 'raise', an error is raised. Default is 'drop'.

#[polars_expr(output_type=Float32)]
fn rolling_least_squares(inputs: &[Series], kwargs: RollingKwargs) -> PolarsResult<Series> {
    let null_policy = kwargs.get_null_policy();
    assert!(
        matches!(null_policy, NullPolicy::Ignore | NullPolicy::Zero),
        "null policies which drop rows are not yet supported for rolling least squares"
    );
    let (y, x) = convert_polars_to_ndarray(inputs, &null_policy, None);
    let coefficients = solve_rolling_ols(
        &y,
        &x,
        kwargs.window_size,
        kwargs.min_periods,
        kwargs.use_woodbury,
        kwargs.alpha,
    );
    let predictions = (&x * &coefficients).sum_axis(Axis(1));
    Ok(Series::from_vec(inputs[0].name(), predictions.to_vec()))
}

azmyrajab commented 7 months ago

Hi @AdrienDart - this should now be resolved with https://github.com/azmyrajab/polars_ols/commit/567ab2d1176dbd3f965f41dec8366cf008290816 (specifically the changes to line 532 and 573 of least_squares.rs - basically the transition at window_size was double counted unintentionally before).

It should be fixed now and tested against statsmodels with all sorts of min_period, window combinations. It should hopefully match perfectly now. Prior to "min_periods" it will now produce NaNs (instead of zeros) so that behaviour matches there too.

I'll wait for CI tests and release new version shortly. These changes don't yet tackle the presence of nulls for rolling / recursive methods, I plan to tackle that next.

@pytest.mark.parametrize(
    "window_size,min_periods,use_woodbury",
    [
        (2, 2, False),
        (10, 2, False),
        (10, 2, True),
        (63, 5, False),
        (252, 5, False),
        (252, 5, True),
    ],
)
def test_rolling_least_squares(window_size: int, min_periods: int, use_woodbury: bool):
    df = _make_data(n_samples=10_000)
    with timer("\nrolling ols"):
        coef_rolling = (
            df.lazy()
            .select(
                pl.col("y")
                .least_squares.rolling_ols(
                    pl.col("x1"),
                    pl.col("x2"),
                    mode="coefficients",
                    window_size=window_size,
                    min_periods=min_periods,
                    use_woodbury=use_woodbury,
                )
                .alias("coefficients")
            )
            .unnest("coefficients")
            .collect()
            .to_numpy()
        )
    with timer("rolling ols statsmodels"):
        mdl = RollingOLS(
            df["y"].to_numpy(),
            df[["x1", "x2"]].to_numpy(),
            window=window_size,
            min_nobs=min_periods,
            expanding=True,
        ).fit()
    assert np.allclose(
        coef_rolling,
        mdl.params,
        rtol=1.0e-3,
        atol=1.0e-3,
        equal_nan=True,
    )

azmyrajab commented 7 months ago

should be resolved now

AdrienDart commented 7 months ago

Hi, thank you very much for looking into this! I get the expected values now! Regarding your comment on the null policies, that's exactly that! Please kindly let me know once you get the time to implement that :) and again, thank you for your good work!

azmyrajab / polars_ols

Rolling OLS inconsistent result #6