Closed deanm0000 closed 5 months ago
This is good. I have thought about this before. It seems like returning struct will improve UX a lot. However there is one problem.
The main issue I see is that:
For regression in different segments (regression in group_by context in Polars jargon), I can return one output value, the coefficients. This is just like any other aggregation function. But if I need to return residual, this will make "regression in group by" impossible, or uglier in for the end user.
I can create an option to choose if you want residual or coefficient and adjust accordingly, but then the same expression will return very different stuff, which I am not sure if that is good practice or not..
What's your take?
In terms of implementation I'm at a disadvantage because I don't know rust. From a purely UX perspective I'd think it'd be a parameter like pl.col('a').lstsq(..., return_resid=True)
if you make that True as a group_by.agg then I'd think it'd return a list (of lists or structs) similar to how group_by('z').agg(pl.col('a'))
would return a
as a list.
That said, while that's an ugly output type, I wouldn't expect people to do group_by('z').agg(pl.col('a').lstsq(..., return_resid=True))
as it doesn't make a ton of sense but they could still do with_columns(pl.col('a').lstsq(..., return_resid=True).over('z'))
Alternatively maybe it's better as it's own function like pl.col('a').lstsq_w_resids(...)
https://github.com/functime-org/functime/issues/176
Adding this here. It seems like there is also an interest in retrieving the p-value for each regression coefficient here. I already have all the pieces for that, lstsq + t-distribution. This might be doable and will be the next task I will be working on.
Hi @deanm0000 ,
Just merged the update a little bit ago. Please refer to the example in main branch here: https://github.com/abstractqqq/polars_ds_extension/blob/main/examples/basics.ipynb
New release will be coming soon
Here's a wrapper for what I have in mind
This allows to do
Perhaps include options to include residuals or fitted values in the return.
Lastly, I don't know if this is practical, but including p-values would also be good although I know it's getting cumbersome given the data format. As a struct output I'd probably make it something like
struct_prefix+x.meta.output_name()+pval+struct_suffix
. If it's a list output then maybe the returns are 2 item arrays where item[0] is the coefficient and item[1] is the pval. I'm just spit balling here though.I just looked at the faer and it doesn't look like the pvalues are easily attainable but I stumbled on them having polars_to_faerfxx functions so just fyi if that's a new feature by them.