azmyrajab / polars_ols

Polars least squares extension - enables fast linear model polar expressions
MIT License
102 stars 9 forks source link

wls error: chunked array is not contiguous #28

Closed tmct closed 1 month ago

tmct commented 1 month ago

Hi,

I have witnessed the following error while performing a WLS with this library:

panicked at src/expressions.rs:54:22:
Failed to convert f64 series to ndarray: ComputeError(ErrString("chunked array is not contiguous"))

I fixed this by first transforming my df with pl.DataFrame(numpy_array, schema=df.columns) (and my colleague points out that DataFrame.rechunk() is even better) but please could polars-ols handle non-contiguous DataFrames gracefully? (Or if you don't want to support that, please could the error message let users know how to fix, e.g. using df = df.rechunk()?)

Apologies, I cannot share my repro for this... I tried making some shareable simple examples but it would take more effort to get right. It was just from a normal wls call a bit like this:

pl.col("a").least_squares.wls(
    *[f"foo{i}" for i in range(10)],
    sample_weights="weight",
).alias("a_preds")

polars-ols==0.3.3 polars==1.4.1 Python 3.12.5

Many thanks

azmyrajab commented 1 month ago

Hi Tom, thanks for reporting this. yes, I’d want to handle this gracefully (if doable) on the rust / library side.

A problem on my side is polars ols is purely expression based (lazy) and so not easy for me to add a dataframe level rechunk on the users behalf. But what we can do is add expression level rechunk if the data isn’t contiguous.

Would you mind please, in your own example (I don’t need to see it), try adding something like:

pl.col("a").rechunk().least_squares.wls(
    *[pl.col(f"foo{i}").rechunk() for i in range(10)],
    sample_weights=pl.col("weight").rechunk(),
).alias("a_preds")

https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.rechunk.html

And letting me know if this fixes the issue for the original data that was problematic ? If it does, it should be fairly easy for me to add something that does something like that for the user - could aim to release something this weekend if so.

Azmy

jacksonriley commented 1 month ago

I had taken a quick look at this but couldn't figure out how it could be happening, given that the array is explicitly rechunked just before this assertion gets hit: https://github.com/azmyrajab/polars_ols/blob/49ae93c1e6b6d6168c9781c411d31def460b1f08/src/expressions.rs#L47-L54

I suppose it's conceivable that s.rechunk().f64() could result in a ChunkedArray with multiple chunks, but that would feel surprising to me!

azmyrajab commented 1 month ago

Thanks @jacksonriley, I remember adding a rechunk somewhere which i thought should have prevented this sort of issue from happening.

It’s a bit strange/surprising for me too, but will trace through if casting as f64 somehow applies multiple chunks by default (if so will try casting before rechunk) and otherwise chase through the to_ndarray code path that breaks and see if there is something unexpected there. We’ll figure something otherwise I’ll ask someone from polars team for 👀

azmyrajab commented 1 month ago

hi @jacksonriley @tmct

Tried to make sure this issue doesn't bite again - couldn't get a clean reproduce but added a test and made sure that no nulls and a re-chunk are guarenteed before calling to_ndarray(). to_ndarray() which calls cont_slice() which checks two things: a) no nulls and b) there is one chunk, so these should be impossible in the new code

In any case, if you observe differently please try to reproduce and I can take another look. But don't think this is likely in this version - I hope!


https://github.com/azmyrajab/polars_ols/blob/0422b62ddcaa6c83e32b7a6fcb29351b4bfc3ec1/src/expressions.rs#L32-L44

 /// Contiguous slice
    pub fn cont_slice(&self) -> PolarsResult<&[T::Native]> {
        polars_ensure!(
            self.chunks.len() == 1 && self.chunks[0].null_count() == 0,
            ComputeError: "chunked array is not contiguous"
        );
        Ok(self.downcast_iter().next().map(|arr| arr.values()).unwrap())
    }

https://github.com/azmyrajab/polars_ols/blob/0422b62ddcaa6c83e32b7a6fcb29351b4bfc3ec1/tests/test_ols.py#L969C5-L995

jacksonriley commented 1 month ago

Thanks for taking a look at this! It looks pretty bullet-proof now 😁 🤞

tmct commented 1 month ago

Many thanks!