elixir-explorer / explorer

Series (one-dimensional) and dataframes (two-dimensional) for fast and elegant data exploration in Elixir
https://hexdocs.pm/explorer
MIT License
1.12k stars 123 forks source link

v0.8.0 summarise/count bug? #842

Closed mlineen closed 9 months ago

mlineen commented 9 months ago

Not sure if this is a bug or a feature, but when a nil is present in the series, summarise with count skips the nil (this behavior changed from 0.7.2 to 0.8.0):

iex(3)> Explorer.DataFrame.new(testing: [1, nil, 3]) |> Explorer.DataFrame.summarise(count: count(testing))
#Explorer.DataFrame<
  Polars[1 x 1]
  count u32 [2]
>
iex(2)> Explorer.DataFrame.new(testing: [1, 2, 3]) |> Explorer.DataFrame.summarise(count: count(testing))
#Explorer.DataFrame<
  Polars[1 x 1]
  count u32 [3]
>
billylanchantin commented 9 months ago

Hi @mlineen,

Thanks for the report!

Not sure if this is a bug or a feature

This looks like a bug to me. I didn't see any notes in the Polars release about changing the behavior of count. And AFAICT, the behavior should match that of Explorer.Series.count/1. It does not since:

import Explorer.Series
[1, nil, 3] |> from_list() |> count() #=> 3

I'll try to dig more into it tonight to make sure.

billylanchantin commented 9 months ago

Interesting! So this isn't a "bug" per se. As of Polars 0.36.2, this is the default behavior. From the release:

💥 Breaking changes

We now need to use a different function under the hood if we want the old behavior.

WDYT @philss, @josevalim, @cigrainger?

josevalim commented 9 months ago

@billylanchantin this is tricky. count in SQL does not include nulls indeed. And there is also a chance the behaviour of Series.count outside of a lazy query does not handle nils. So probably what we need to do is:

WDYT?

billylanchantin commented 9 months ago

I think that's a good plan. We'll want to call out that count (now) works like it does in SQL in the docs.

Will we also need to tackle Series.size not being available in lazy series? We may be able to make that work with len:

https://docs.rs/polars/latest/polars/?search=len

benkimpel commented 9 months ago

@billylanchantin this is tricky. count in SQL does not include nulls indeed. And there is also a chance the behaviour of Series.count outside of a lazy query does not handle nils. So probably what we need to do is:

* Make `Series.count` always discard nils, both inside and outside lazy queries

* Make `Series.size` return the whole size. Inside groups, it should return the size of each group

WDYT?

@josevalim is this an accurate summary of your proposal?

Explorer Series.count = Polars Series count Explorer Series.size ≈ Polars Series len (≈ because I'm not sure what Polars behavior is regarding groups)

(I work with @mlineen)

josevalim commented 9 months ago

Yes!!

cigrainger commented 9 months ago

Yep I agree with that proposal!

billylanchantin commented 9 months ago

Closed by: