abstractqqq / polars_ds_extension

Polars extension for general data science use cases
MIT License
349 stars 23 forks source link

polars.exceptions.ComputeError: the name: 'resid' passed to `LazyFrame.with_columns` is duplicate #93

Closed wukan1986 closed 6 months ago

wukan1986 commented 7 months ago

Does not support regular expressions

import polars as pl
import polars_ds as pld  # noqa
from pandas._testing import makeTimeDataFrame

df = makeTimeDataFrame()
df = df.rename(columns={'B': 'B_1', 'C': 'B_2', 'D': 'B_3', })
df = pl.from_pandas(df, include_index=True)

x = df.with_columns(pl.col('A').num.lstsq(pl.col('B_1'), pl.col('B_2'), pl.col('B_3'), return_pred=True).struct.field('resid'))
print(x)
"""
shape: (30, 6)
┌─────────────────────┬───────────┬───────────┬───────────┬───────────┬───────────┐
│ None                ┆ A         ┆ B_1       ┆ B_2       ┆ B_3       ┆ resid     │
│ ---                 ┆ ---       ┆ ---       ┆ ---       ┆ ---       ┆ ---       │
│ datetime[ns]        ┆ f64       ┆ f64       ┆ f64       ┆ f64       ┆ f64       │
╞═════════════════════╪═══════════╪═══════════╪═══════════╪═══════════╪═══════════╡
│ 2000-01-03 00:00:00 ┆ -0.109302 ┆ 0.8222    ┆ 0.464218  ┆ -0.083627 ┆ -0.046888 │
│ 2000-01-04 00:00:00 ┆ 0.560893  ┆ 0.173103  ┆ -0.471642 ┆ -0.467092 ┆ 0.564656  │
│ 2000-01-05 00:00:00 ┆ 0.450257  ┆ 1.07729   ┆ 0.180542  ┆ -1.416637 ┆ 0.723666  │

......
"""
x = df.with_columns(pl.col('A').num.lstsq(pl.col('^B_.*$'), return_pred=True).struct.field('resid'))
print(x)
r"""
    x = df.with_columns(pl.col('A').num.lstsq(pl.col('^B_.*$'), return_pred=True).struct.field('resid'))
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "D:\Users\Kan\miniconda3\envs\py311_1\Lib\site-packages\polars\dataframe\frame.py", line 8290, in with_columns
    return self.lazy().with_columns(*exprs, **named_exprs).collect(_eager=True)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "D:\Users\Kan\miniconda3\envs\py311_1\Lib\site-packages\polars\lazyframe\frame.py", line 1937, in collect
    return wrap_df(ldf.collect())
                   ^^^^^^^^^^^^^
polars.exceptions.ComputeError: the name: 'resid' passed to `LazyFrame.with_columns` is duplicate

It's possible that multiple expressions are returning the same default column name. If this is the case, try renaming the columns with `.alias("new_name")` to avoid duplicate column names.

Error originated just after this operation:
DF ["None", "A", "B_1", "B_2"]; PROJECT */5 COLUMNS; SELECTION: "None"
"""
abstractqqq commented 7 months ago

Thank you for the issue. I am looking into this. Not really sure why..

A quick fix can be

df.with_columns( pl.col('A').num.lstsq(*[pl.col(c) for c in df.columns if c.startswith("B_")], return_pred=True).struct.field('resid') )

and obviously you can wrap the list comprehension in a function and make this part shorter and use regex. But I know it is better to work with only Polars

deanm0000 commented 7 months ago

It seems like

df.with_columns(pl.col('A').num.lstsq(pl.col('^B_.*$'), return_pred=True).struct.field('resid'))

is internally being dispatched as

df.with_columns(pl.col('A').num.lstsq(pl.col(x), return_pred=True).struct.field('resid')
for x in df.columns if x.startswith("B_"))

in other words, it's trying to do a A regress B_1 and then A regress B_2, etc.

There are two ways you could maybe fix it.

  1. make your function look more like concat_list which takes a Vec instead of a list of &Series. I'm not sure how much easier said than done this one is.
  2. On the python side, do args=[pl.struct(variables), pl.lit(add_bias, dtype=pl.Boolean)], and then on the rust side, since you turn the X vars into a new DataFrame anyway, I think you can directly do that with a struct so
let struct_col = &inputs[1].rename("Xvars");
let df_x = DataFrame::new(vec![struct_col.into()]).unwrap().unnest(vec!["Xvars"]).unwrap();

I think the 0th element of inputs is the y Var and then the next one would be the struct and the [2] spot would be the bool.

wukan1986 commented 7 months ago

FYI.

https://github.com/pola-rs/polars/issues/14858#issuecomment-1979061693