[FEAT] Add polars support

elephaint commented 3 weeks ago

Adds support for Polars via Narwhals.

The extension to all frameworks supported under Narwhals should hereafter be (nearly) trivial.

Breaking

PR removes support for NIXTLA_ID_AS_COL, i.e., hierarchicalforecast from now on will return everything as a flat table, without an index in case of a Pandas DataFrame.

Open items

[x] Convert core.ipynb
[x] Convert evaluation.ipynb
[x] Convert methods.ipynb
[x] Convert probabilistic_methods.ipynb
[x] Convert utils.ipynb
[x] Add tutorial on using with Polars
[x] Re-run existing examples in the docs to check correct workings & adapt for deprecation of unique_id as index
[x] PR shouldn't be breaking

Fixes / better solutions required

[x] Fix performance regression in aggregate function
[x] Fix .agg function not allowing multiple aggregations of the same column (needed for exogenous)
[x] Fix how to deal with unique_id being index everywhere but not for Polars frames
[x] Fix plotting in utils - it now just converts Polars to Pandas when plotting, that isn't nice
[x] Fix evaluation not aggregating correctly

Tests

[x] Add equivalence tests to unit tests for Polars
[x] Inspect performance regression from adding Polars support

Performance evaluation

Reconcilers seem to run somewhat faster after this PR. Reconciler0 = MinTShrink, Reconciler1 = BottomUp

This PR:

Screenshot 2024-11-12 232320

Current main:

Screenshot 2024-11-12 232328

Test script for PR in tests\test_benchmark.py, run with pytest tests\test_benchmark.py::test_reconciler -v -s --benchmark-min-rounds=20 --disable-warnings

review-notebook-app[bot] commented 3 weeks ago

Check out this pull request on

See visual diffs & provide feedback on Jupyter Notebooks.

Powered by ReviewNB

MarcoGorelli commented 2 weeks ago

Hey @elephaint , thanks for trying out Narwhals! We can prioritise pivot as that would be helpful to you

We'd previously run into issues with Polars pre1.0 inconsistencies, but from this PR it looks you only need the 1.0+ behaviour, right?

Is there anything else we could do in Narwhals to help you out here?

elephaint commented 2 weeks ago

Hey @elephaint , thanks for trying out Narwhals! We can prioritise pivot as that would be helpful to you

We'd previously run into issues with Polars pre1.0 inconsistencies, but from this PR it looks you only need the 1.0+ behaviour, right?

Is there anything else we could do in Narwhals to help you out here?

Thanks! For this PR / repo I think indeed at the moment the only thing I miss is pivot - which I kind of get as it's an expensive op (as most fast frameworks are column based).

Another minor thing that would be great is ability to ascertain datatypes at the op level without casting afterwards. For example, doing .to_dummies() with Polars will result in an Uint8 but with Pandas in a Float64. I know it can be done by subsequently casting over all the generated columns, but not sure this is the most efficient way when 1000+ columns are generated at once through the .to_dummies() command. But then again, this is minor, in this PR I settled on not using dummies.

A thing that I found a bit unintuitive is that Narwhals seems to keep track of the order of a Pandas Integer Index even after narwhallifying it - I didn't expect that. Again minor thing, because I just need to remember to do .maybe_reset_index() after every op that potentially changes the order.

But awesome library, keep up the good work! For this PR I wanted to play with it for Polars support, maybe later I'll try other frameworks - should almost work out of the box now (save for the pivot op).

Edit: maybe one other thing - I see these warnings:

but it doesn't show which codeline it affects, so difficult to debug? I know where I'm adding columns and stuff, but that's in a lot of places.

MarcoGorelli commented 1 week ago

Thanks so much for your feedback! 🙏

pivot: we're revisiting the PR and will try to make a release shortly
to_dummies: we've addressed that as part of this PR so that the output dtype is consistently Int8, we'd overlooked the dtype difference the first time round. Unsigned integers can be a bit unexpected when working with pandas, so Int8 seemed like a good compromise. Would this work for you?
fragmentation warning: technically the warning is coming from pandas, but it's really useful that you've reported this to us - I'll take a look to see if there's anything we're doing that's causing it (and whether it's an issue)

elephaint commented 1 week ago

Thanks so much for your feedback! 🙏

pivot: we're revisiting the PR and will try to make a release shortly

Wow, awesome. I'll have a go at it once released.

to_dummies: we've addressed that as part of this PR so that the output dtype is consistently Int8, we'd overlooked the dtype difference the first time round. Unsigned integers can be a bit unexpected when working with pandas, so Int8 seemed like a good compromise. Would this work for you?

Yes, perfect! I think in general what I'd expect (in an ideal world 🙈 - I know this is insanely hard to get perfect) from Narwhals is output equivalence up to DataFrame type (i.e. everything is the same, except the 'container' of the data - e.g. dtypes, row order, column order, column names, precision, nulls, nans, infs, categories are all similar).

fragmentation warning: technically the warning is coming from pandas, but it's really useful that you've reported this to us - I'll take a look to see if there's anything we're doing that's causing it (and whether it's an issue)

👍

MarcoGorelli commented 1 week ago

thanks! it's released, and the fragmentation warning is hopefully addressed too

elephaint commented 1 week ago

thanks! it's released, and the fragmentation warning is hopefully addressed too

Awesome, thanks, I'll have a look at it

elephaint commented 1 week ago

thanks! it's released, and the fragmentation warning is hopefully addressed too

Great work, I don't see the fragmentation messages anymore, and pivot works as expected. Thanks! 👍

Nixtla / hierarchicalforecast