casact / chainladder-python

Actuarial reserving in Python
https://chainladder-python.readthedocs.io/en/latest/
Mozilla Public License 2.0
186 stars 71 forks source link

[ENH] Support polars via Python dataframe interchange protocol #474

Closed lorentzenchr closed 9 months ago

lorentzenchr commented 10 months ago

Is your feature request related to a problem? Please describe. I would like to preprocess my data in other data containers than pandas, e.g. pyarrow and polars, and then apply Triangle to it.

Is your feature request at odds with the scope of the package?

import chainladder as cl
import pandas as pd
import polars as pl

raa_df = pd.read_csv(
    "https://raw.githubusercontent.com/casact/chainladder-python/master/chainladder/utils/data/raa.csv"
)

# This works just fine
raa = cl.Triangle(
    raa_df,
    origin="origin",
    development="development",
    columns="values",
    cumulative=True,
)

# This gives an error
raa_pl = pl.read_csv(
    "https://raw.githubusercontent.com/casact/chainladder-python/master/chainladder/utils/data/raa.csv"
)

cl.Triangle(
    raa_pl,
    origin="origin",
    development="development",
    columns="values",
    cumulative=True,
)

results in

AttributeError: 'DataFrame' object has no attribute 'iloc'

Describe the solution you'd like Supporting data via the Python dataframe interchange protocol might be an optimal approach.

Describe alternatives you've considered One could also consider polars specific code path. But this might result in more maintenance burden: Where to stop? Add pyarrow, too?

Additional context A similar feature request is https://github.com/scikit-learn/scikit-learn/issues/25896 which is currently worked on.

johalnes commented 9 months ago

Do you think something like this in Plotly, just to get a zero-copy to pandas @lorentzenchr?

I see you have an Polars rewrite as a branch @jbogaardt - is converting to polars what you consider the future of the package? πŸ˜„

jbogaardt commented 9 months ago

For some reason i didnt see this issue when @lorentzenchr created it. It is an worthwhile enhacement since thats the direction other tools seem to be going.

@johalnes, I did start the pl_tri branch to experiment with polars over a numpy/sparse backend. It is much simpler code, fewer dependencies is generally faster than the main branch, and can possibly be extended to other languages where the polars API has been implemented (R, node.js). There are some computations that are slower though. I think where arrays make more sense, polars might be slower that the current implementation, but for data manipulation it is much faster. Overall, I've been impressed with the speed + simplification polars brings. If I can get it to a point where there is minimal impact to the end user API and performance, then it probably is the right move.

johalnes commented 9 months ago

Totally agree with you @jbogaardt! I do think the code get easier and more elegant with Polars. But quite a lot of work it would seem, even if it looks like you already have done a lot! Added an attempt at the interchange protocol, and luckily the pandas dev team have done most of the work for us 😊

Is there any more fundament changes you have considered? Since the package still isn't at 1.0, you could do some breaking changes and no one can arrest you for doing itπŸ˜‰ Given more performant, future proof and readable code I would think most users would be quite happy!

@lorentzenchr - give it a try if you have the time!

lorentzenchr commented 9 months ago

To be clear, my proposal is just to support dataframes that support the dataframe interchange protocol in cl.Triangle. Replacing pandas as the internal computation backend is a totally different story that is better discussed separately.

johalnes commented 9 months ago

I know @lorentzenchr, sorry for going off topic. Just got exited!

I think the pull request merged closes this issue? Or have I missed something?

lorentzenchr commented 9 months ago

Yes, we can close. Thanks for the fast PR.