Nixtla / hierarchicalforecast

Probabilistic Hierarchical forecasting πŸ‘‘ with statistical and econometric methods.
https://nixtlaverse.nixtla.io/hierarchicalforecast
Apache License 2.0
590 stars 76 forks source link

[FEAT] Add polars support #305

Open elephaint opened 3 weeks ago

elephaint commented 3 weeks ago

Adds support for Polars via Narwhals.

The extension to all frameworks supported under Narwhals should hereafter be (nearly) trivial.

Breaking

PR removes support for NIXTLA_ID_AS_COL, i.e., hierarchicalforecast from now on will return everything as a flat table, without an index in case of a Pandas DataFrame.

Open items

Fixes / better solutions required

Tests

Performance evaluation

Reconcilers seem to run somewhat faster after this PR. Reconciler0 = MinTShrink, Reconciler1 = BottomUp

This PR:

Screenshot 2024-11-12 232320

Current main:

Screenshot 2024-11-12 232328

Test script for PR in tests\test_benchmark.py, run with pytest tests\test_benchmark.py::test_reconciler -v -s --benchmark-min-rounds=20 --disable-warnings

review-notebook-app[bot] commented 3 weeks ago

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

MarcoGorelli commented 2 weeks ago

Hey @elephaint , thanks for trying out Narwhals! We can prioritise pivot as that would be helpful to you

We'd previously run into issues with Polars pre1.0 inconsistencies, but from this PR it looks you only need the 1.0+ behaviour, right?

Is there anything else we could do in Narwhals to help you out here?

elephaint commented 2 weeks ago

Hey @elephaint , thanks for trying out Narwhals! We can prioritise pivot as that would be helpful to you

We'd previously run into issues with Polars pre1.0 inconsistencies, but from this PR it looks you only need the 1.0+ behaviour, right?

Is there anything else we could do in Narwhals to help you out here?

Thanks! For this PR / repo I think indeed at the moment the only thing I miss is pivot - which I kind of get as it's an expensive op (as most fast frameworks are column based).

Another minor thing that would be great is ability to ascertain datatypes at the op level without casting afterwards. For example, doing .to_dummies() with Polars will result in an Uint8 but with Pandas in a Float64. I know it can be done by subsequently casting over all the generated columns, but not sure this is the most efficient way when 1000+ columns are generated at once through the .to_dummies() command. But then again, this is minor, in this PR I settled on not using dummies.

A thing that I found a bit unintuitive is that Narwhals seems to keep track of the order of a Pandas Integer Index even after narwhallifying it - I didn't expect that. Again minor thing, because I just need to remember to do .maybe_reset_index() after every op that potentially changes the order.

But awesome library, keep up the good work! For this PR I wanted to play with it for Polars support, maybe later I'll try other frameworks - should almost work out of the box now (save for the pivot op).

Edit: maybe one other thing - I see these warnings:

image

but it doesn't show which codeline it affects, so difficult to debug? I know where I'm adding columns and stuff, but that's in a lot of places.

MarcoGorelli commented 1 week ago

Thanks so much for your feedback! πŸ™

elephaint commented 1 week ago

Thanks so much for your feedback! πŸ™

  • pivot: we're revisiting the PR and will try to make a release shortly

Wow, awesome. I'll have a go at it once released.

  • to_dummies: we've addressed that as part of this PR so that the output dtype is consistently Int8, we'd overlooked the dtype difference the first time round. Unsigned integers can be a bit unexpected when working with pandas, so Int8 seemed like a good compromise. Would this work for you?

Yes, perfect! I think in general what I'd expect (in an ideal world πŸ™ˆ - I know this is insanely hard to get perfect) from Narwhals is output equivalence up to DataFrame type (i.e. everything is the same, except the 'container' of the data - e.g. dtypes, row order, column order, column names, precision, nulls, nans, infs, categories are all similar).

  • fragmentation warning: technically the warning is coming from pandas, but it's really useful that you've reported this to us - I'll take a look to see if there's anything we're doing that's causing it (and whether it's an issue)

πŸ‘

MarcoGorelli commented 1 week ago

thanks! it's released, and the fragmentation warning is hopefully addressed too

elephaint commented 1 week ago

thanks! it's released, and the fragmentation warning is hopefully addressed too

Awesome, thanks, I'll have a look at it

elephaint commented 1 week ago

thanks! it's released, and the fragmentation warning is hopefully addressed too

Great work, I don't see the fragmentation messages anymore, and pivot works as expected. Thanks! πŸ‘