Get coefficients when using .over

braaannigan commented 3 months ago

This is such a great tool!

I have one question, not sure if its a feature request or I've just failed to understand something.

I'd like to get the coefficients when doing regression by group (so I can apply them to new data).

Taking your example I'd like to do:

ols = pls.compute_least_squares(pl.col("y"),pl.col("x1"),mode="coefficients").over("group")
df.select(ols)

This works find without over and each row is a coefficient. With over this would need to be a long dataframe with a column showing which group and which coefficient it belongs to.

It looks like the output in this case would need to be a pl.Struct with group, variable name and value fields.

Any thoughts on this?

azmyrajab commented 3 months ago

Hi @braaannigan, thanks!

No, you did not misunderstand anything at all, what you describe is a perfectly valid feature request (and better default behaviour in general). Right now mode=predictions will broadcast to any shape but you are right in “coefficients” not doing that correctly right now.

I think the moving window models (RLS, Rolling) will have their coefficients broadcast to any shape correctly right now because of the way their coefficients are represented (as List[f32]).

If we change the default “coefficients” behaviour to return a list (or a struct as you noted) the result should broadcast nicely too. (.with_columns would repeat the coefficients across all rows to match original data shape and .select will produce a long frame per group as you expected it should be for your usecase).

let me think about precise mechanics a little more but should definitely be able to do something that will get you what you want! Thanks for raising this.

azmyrajab commented 3 months ago

Hi @braaannigan

This should hopefully be now done! Released to v0.2 - major changes:

mode="coefficients" always returns polars struct type consistently, which made it easier to support broadcasting to different output shapes
coefficients now know how to resize if needed. For e.g. in a .group_by, .over, or .with_columns context it broadcasts to the shape the data it applies to whereas if called in a .select with no other columns it will return a single row dataframe. (the returns_scalar=True argument is used to accomplish this in register_plugin_function)
Added a least_squares.predict helper function which facilitates out-of-sample prediction. It does so in rust (using ndarray) and as an expression that can be lazy/chained: df_test.select(pl.col("coefficients_train").least_squares.predict(pl.col("x1"), pl.col("x2")).alias("predictions_test"))

You will still need to join the coefficients struct column with the test data, but no need to "unnest" the coefficients struct and compute products w/ features and sum yourself (which is loosely equivalent to what it is doing in rust).

Would it be possible for you to try upgrading and letting me know if this works for you? The last section here might be a good template

Thanks again for raising this

azmyrajab commented 3 months ago

Hi @braaannigan

The latest version basically does exactly the behaviour you describe (with both the struct creation and subsequent out-of-sample prediction of a struct coefficients column onto test features done efficiently in rust); so closing as completed pretty much in line with your comment (thanks!). However, please drop a line if anything doesn't work as expected and happy to take another look

To recap, your example would become:

df = _make_data(n_groups=10)  # your train data
df_coef = df.select(
        "group",
        pls.compute_least_squares(pl.col("y"), pl.col("x1"), pl.col("x2"), mode="coefficients")
        .over("group")
        .alias("coefficients")
        )

>>> df_coef.head()
shape: (5, 2)
┌───────┬─────────────────────┐
│ group ┆ coefficients        │
│ ---   ┆ ---                 │
│ i64   ┆ struct[2]           │
╞═══════╪═════════════════════╡
│ 2     ┆ {0.997433,1.00141}  │
│ 7     ┆ {0.997332,0.99887}  │
│ 5     ┆ {1.000659,1.002979} │
│ 1     ┆ {0.998089,1.002442} │
│ 6     ┆ {0.999837,0.996519} │
└───────┴─────────────────────┘

There will be as many rows of coefficients as samples in your dataframe as expression now knows it has to broadcast because of the .over() context (and if you take .unique(), you'll get as many unique coefficient rows as groups in this particular example). It will also now broadcast correctly as well if called in a .with_columns() context regardless of if expression has .over().

The field names of the struct are, as you expect, the names of the features model was trained on - so you can "unnest" to get:

>>> df_coef.unnest("coefficients").head()
shape: (5, 3)
┌───────┬──────────┬──────────┐
│ group ┆ x1       ┆ x2       │
│ ---   ┆ ---      ┆ ---      │
│ i64   ┆ f32      ┆ f32      │
╞═══════╪══════════╪══════════╡
│ 2     ┆ 0.997433 ┆ 1.00141  │
│ 7     ┆ 0.997332 ┆ 0.99887  │
│ 5     ┆ 1.000659 ┆ 1.002979 │
│ 1     ┆ 0.998089 ┆ 1.002442 │
│ 6     ┆ 0.999837 ┆ 0.996519 │
└───────┴──────────┴──────────┘

To predict on some test data, you would simply need to align your train coefficients with test data and run pls.predict(...) (or equivalentlypl.col("coefficients").least_squares.predict(...)):

df_test = _make_data(n_groups=10)  # your test data
df_test = (
    df_test.join(df_coef.unique(), on="group")  # or join on whatever
    .with_columns(pls.predict(pl.col("coefficients"), pl.col("x1"), pl.col("x2"))
                  .alias("test_predictions"))
)
>>> df_test.head()
shape: (5, 6)
┌───────────┬───────────┬───────────┬───────┬─────────────────────┬──────────────────┐
│ y         ┆ x1        ┆ x2        ┆ group ┆ coefficients        ┆ test_predictions │
│ ---       ┆ ---       ┆ ---       ┆ ---   ┆ ---                 ┆ ---              │
│ f32       ┆ f32       ┆ f32       ┆ i64   ┆ struct[2]           ┆ f32              │
╞═══════════╪═══════════╪═══════════╪═══════╪═════════════════════╪══════════════════╡
│ 0.025985  ┆ 0.12573   ┆ -0.132105 ┆ 2     ┆ {0.997433,1.00141}  ┆ -0.006884        │
│ 0.857969  ┆ 0.640423  ┆ 0.1049    ┆ 7     ┆ {0.997332,0.99887}  ┆ 0.743496         │
│ -0.160373 ┆ -0.535669 ┆ 0.361595  ┆ 5     ┆ {1.000659,1.002979} ┆ -0.17335         │
│ 2.308727  ┆ 1.304     ┆ 0.947081  ┆ 1     ┆ {0.998089,1.002442} ┆ 2.250902         │
│ -1.944108 ┆ -0.703735 ┆ -1.265422 ┆ 6     ┆ {0.999837,0.996519} ┆ -1.964638        │
└───────────┴───────────┴───────────┴───────┴─────────────────────┴──────────────────┘

It should be pretty efficient and several additional / edge cases are now in the tests

Finally, your train data may have nulls in it, so in the latest version, you can pass null_policy="zero", "drop", etc. and the library will handle them for you based on your choice prior to fitting.

azmyrajab / polars_ols

Get coefficients when using .over #2