Closed braaannigan closed 3 months ago
Hi @braaannigan, thanks!
No, you did not misunderstand anything at all, what you describe is a perfectly valid feature request (and better default behaviour in general). Right now mode=predictions will broadcast to any shape but you are right in “coefficients” not doing that correctly right now.
I think the moving window models (RLS, Rolling) will have their coefficients broadcast to any shape correctly right now because of the way their coefficients are represented (as List[f32]).
If we change the default “coefficients” behaviour to return a list (or a struct as you noted) the result should broadcast nicely too. (.with_columns would repeat the coefficients across all rows to match original data shape and .select will produce a long frame per group as you expected it should be for your usecase).
let me think about precise mechanics a little more but should definitely be able to do something that will get you what you want! Thanks for raising this.
Hi @braaannigan
This should hopefully be now done! Released to v0.2 - major changes:
mode="coefficients"
always returns polars struct type consistently, which made it easier to support broadcasting to different output shapes
coefficients now know how to resize if needed. For e.g. in a .group_by
, .over
, or .with_columns
context it broadcasts to the shape the data it applies to whereas if called in a .select
with no other columns it will return a single row dataframe. (the returns_scalar=True
argument is used to accomplish this in register_plugin_function)
Added a least_squares.predict
helper function which facilitates out-of-sample prediction. It does so in rust (using ndarray) and as an expression that can be lazy/chained:
df_test.select(pl.col("coefficients_train").least_squares.predict(pl.col("x1"), pl.col("x2")).alias("predictions_test"))
You will still need to join the coefficients struct column with the test data, but no need to "unnest" the coefficients struct and compute products w/ features and sum yourself (which is loosely equivalent to what it is doing in rust).
Would it be possible for you to try upgrading and letting me know if this works for you? The last section here might be a good template
Thanks again for raising this
Hi @braaannigan
The latest version basically does exactly the behaviour you describe (with both the struct creation and subsequent out-of-sample prediction of a struct coefficients column onto test features done efficiently in rust); so closing as completed pretty much in line with your comment (thanks!). However, please drop a line if anything doesn't work as expected and happy to take another look
To recap, your example would become:
df = _make_data(n_groups=10) # your train data
df_coef = df.select(
"group",
pls.compute_least_squares(pl.col("y"), pl.col("x1"), pl.col("x2"), mode="coefficients")
.over("group")
.alias("coefficients")
)
>>> df_coef.head()
shape: (5, 2)
┌───────┬─────────────────────┐
│ group ┆ coefficients │
│ --- ┆ --- │
│ i64 ┆ struct[2] │
╞═══════╪═════════════════════╡
│ 2 ┆ {0.997433,1.00141} │
│ 7 ┆ {0.997332,0.99887} │
│ 5 ┆ {1.000659,1.002979} │
│ 1 ┆ {0.998089,1.002442} │
│ 6 ┆ {0.999837,0.996519} │
└───────┴─────────────────────┘
There will be as many rows of coefficients as samples in your dataframe as expression now knows it has to broadcast because of the .over()
context (and if you take .unique(), you'll get as many unique coefficient rows as groups in this particular example). It will also now broadcast correctly as well if called in a .with_columns()
context regardless of if expression has .over()
.
The field names of the struct are, as you expect, the names of the features model was trained on - so you can "unnest" to get:
>>> df_coef.unnest("coefficients").head()
shape: (5, 3)
┌───────┬──────────┬──────────┐
│ group ┆ x1 ┆ x2 │
│ --- ┆ --- ┆ --- │
│ i64 ┆ f32 ┆ f32 │
╞═══════╪══════════╪══════════╡
│ 2 ┆ 0.997433 ┆ 1.00141 │
│ 7 ┆ 0.997332 ┆ 0.99887 │
│ 5 ┆ 1.000659 ┆ 1.002979 │
│ 1 ┆ 0.998089 ┆ 1.002442 │
│ 6 ┆ 0.999837 ┆ 0.996519 │
└───────┴──────────┴──────────┘
To predict on some test data, you would simply need to align your train coefficients with test data and run pls.predict(...)
(or equivalentlypl.col("coefficients").least_squares.predict(...)
):
df_test = _make_data(n_groups=10) # your test data
df_test = (
df_test.join(df_coef.unique(), on="group") # or join on whatever
.with_columns(pls.predict(pl.col("coefficients"), pl.col("x1"), pl.col("x2"))
.alias("test_predictions"))
)
>>> df_test.head()
shape: (5, 6)
┌───────────┬───────────┬───────────┬───────┬─────────────────────┬──────────────────┐
│ y ┆ x1 ┆ x2 ┆ group ┆ coefficients ┆ test_predictions │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ f32 ┆ f32 ┆ f32 ┆ i64 ┆ struct[2] ┆ f32 │
╞═══════════╪═══════════╪═══════════╪═══════╪═════════════════════╪══════════════════╡
│ 0.025985 ┆ 0.12573 ┆ -0.132105 ┆ 2 ┆ {0.997433,1.00141} ┆ -0.006884 │
│ 0.857969 ┆ 0.640423 ┆ 0.1049 ┆ 7 ┆ {0.997332,0.99887} ┆ 0.743496 │
│ -0.160373 ┆ -0.535669 ┆ 0.361595 ┆ 5 ┆ {1.000659,1.002979} ┆ -0.17335 │
│ 2.308727 ┆ 1.304 ┆ 0.947081 ┆ 1 ┆ {0.998089,1.002442} ┆ 2.250902 │
│ -1.944108 ┆ -0.703735 ┆ -1.265422 ┆ 6 ┆ {0.999837,0.996519} ┆ -1.964638 │
└───────────┴───────────┴───────────┴───────┴─────────────────────┴──────────────────┘
It should be pretty efficient and several additional / edge cases are now in the tests
Finally, your train data may have nulls in it, so in the latest version, you can pass null_policy="zero", "drop", etc. and the library will handle them for you based on your choice prior to fitting.
This is such a great tool!
I have one question, not sure if its a feature request or I've just failed to understand something.
I'd like to get the coefficients when doing regression by group (so I can apply them to new data).
Taking your example I'd like to do:
This works find without
over
and each row is a coefficient. Withover
this would need to be a long dataframe with a column showing which group and which coefficient it belongs to.It looks like the output in this case would need to be a
pl.Struct
with group, variable name and value fields.Any thoughts on this?