abstractqqq / polars_ds_extension

Polars extension for general data science use cases
MIT License
266 stars 18 forks source link

Add extra options for returning a struct and accepting named_vars for lstsq #47

Closed deanm0000 closed 5 months ago

deanm0000 commented 6 months ago

Here's a wrapper for what I have in mind

def mylstsq(self, *vars, add_bias=False, struct_prefix=None, struct_suffix=None, **named_vars, ):
    out_names=None
    if named_vars is not None:
        vars=list(vars)
        for key, expr in named_vars.items():
            vars.append(expr.alias(key))
    if struct_prefix is not None or struct_suffix is not None:
        if struct_prefix is None:
            struct_prefix=""
        if struct_suffix is None:
            struct_suffix=""
        out_names=[struct_prefix + x.meta.output_name() + struct_suffix for x in vars]
        if add_bias is True:
            out_names.append(struct_prefix + "constant" + struct_suffix)
    func_ret = self.num.lstsq(*vars, add_bias=add_bias)
    if out_names is None:
        return func_ret
    else:
        return func_ret.list.to_struct(fields=out_names)

This allows to do

df.select(reg = pl.col('y').mylstsq(pl.col('x1'),pl.col('x2'),add_bias=True, struct_prefix="reg_")).unnest('reg')

Perhaps include options to include residuals or fitted values in the return.

Lastly, I don't know if this is practical, but including p-values would also be good although I know it's getting cumbersome given the data format. As a struct output I'd probably make it something like struct_prefix+x.meta.output_name()+pval+struct_suffix. If it's a list output then maybe the returns are 2 item arrays where item[0] is the coefficient and item[1] is the pval. I'm just spit balling here though.

I just looked at the faer and it doesn't look like the pvalues are easily attainable but I stumbled on them having polars_to_faerfxx functions so just fyi if that's a new feature by them.

abstractqqq commented 6 months ago

This is good. I have thought about this before. It seems like returning struct will improve UX a lot. However there is one problem.

The main issue I see is that:

For regression in different segments (regression in group_by context in Polars jargon), I can return one output value, the coefficients. This is just like any other aggregation function. But if I need to return residual, this will make "regression in group by" impossible, or uglier in for the end user.

I can create an option to choose if you want residual or coefficient and adjust accordingly, but then the same expression will return very different stuff, which I am not sure if that is good practice or not..

What's your take?

deanm0000 commented 6 months ago

In terms of implementation I'm at a disadvantage because I don't know rust. From a purely UX perspective I'd think it'd be a parameter like pl.col('a').lstsq(..., return_resid=True) if you make that True as a group_by.agg then I'd think it'd return a list (of lists or structs) similar to how group_by('z').agg(pl.col('a')) would return a as a list.

That said, while that's an ugly output type, I wouldn't expect people to do group_by('z').agg(pl.col('a').lstsq(..., return_resid=True)) as it doesn't make a ton of sense but they could still do with_columns(pl.col('a').lstsq(..., return_resid=True).over('z'))

Alternatively maybe it's better as it's own function like pl.col('a').lstsq_w_resids(...)

abstractqqq commented 5 months ago

https://github.com/functime-org/functime/issues/176

Adding this here. It seems like there is also an interest in retrieving the p-value for each regression coefficient here. I already have all the pieces for that, lstsq + t-distribution. This might be doable and will be the next task I will be working on.

abstractqqq commented 5 months ago

Hi @deanm0000 ,

Just merged the update a little bit ago. Please refer to the example in main branch here: https://github.com/abstractqqq/polars_ds_extension/blob/main/examples/basics.ipynb

New release will be coming soon