Open dvic opened 6 days ago
Thanks for the report!
No this is not expected. Both should work.
Curiously, we just fixed a bug where empty lists ended up with the :null
dtype instead of the intended :decimal
dtype. That's what I expected to be going on here. But on main
, now both of your examples fail like this:
** (RuntimeError) DataFrame mismatch.
expected:
names: ["summy"]
dtypes: %{"summy" => {:f, 64}}
got:
names: ["summy"]
dtypes: %{"summy" => {:list, :null}}
(explorer 0.11.0-dev) lib/explorer/polars_backend/shared.ex:61: Explorer.PolarsBackend.Shared.apply_dataframe/4
(explorer 0.11.0-dev) lib/explorer/polars_backend/data_frame.ex:895: Explorer.PolarsBackend.DataFrame.summarise_with/3
iex:4: (file)
Not sure what's going on there yet. Will have to dig deeper.
Not sure if it's related, but it looks like the mean
function doesn't work with decimals in dataframes? (while the docs mention that it's supported: https://hexdocs.pm/explorer/Explorer.Series.html#mean/1-supported-dtypes)
alias Explorer.Series, as: S
require Explorer.DataFrame, as: DF
DF.new([
{"field_a", S.from_list([1], dtype: :s64)},
{"field_b", S.from_list([1], dtype: {:decimal, 10, 2})},
{"field_c", S.from_list([1], dtype: {:decimal, 10, 2})}
])
|> DF.summarise(a_avg: mean(field_a), b_avg: mean(field_b), c_avg: mean(cast(field_c, :f64)))
gives
#Explorer.DataFrame<
Polars[1 x 3]
a_avg f64 [1.0]
b_avg decimal[10, 2] [nil]
c_avg f64 [0.01]
>
While
s = Explorer.Series.from_list([1], dtype: {:decimal, 10, 2})
gives
#Explorer.Series<
Polars[1]
decimal[10, 2] [0.01]
>
I'm looking at some similar symptoms, whith might be related:
df =
[
%{color: "red", material: ["plastic", "glas"], series: ["A", "B", "C"]},
%{color: "blue", material: ["plastic"], series: ["A", "B", "C"]},
%{color: "green", material: ["plastic", "glas"], series: ["A", "B"]},
%{color: "stardust", material: ["plastic"], series: ["A"]}
]
|> Explorer.DataFrame.new()
|> Explorer.DataFrame.explode(:material)
|> Explorer.DataFrame.explode(:series)
colors =
df
|> Explorer.DataFrame.mutate(material: material == "glas", series: series == "C")
|> Explorer.DataFrame.group_by(:color)
|> Explorer.DataFrame.summarise_with(fn frame ->
for col <- [:material, :series] do
{col, Explorer.Series.sum(frame[col])}
end
end)
# #Explorer.DataFrame<
# Polars[4 x 3]
# color string ["red", "blue", "green", "stardust"]
# material u32 [3, 0, 2, 0]
# series u32 [2, 1, 0, 0]
# >
|> Explorer.DataFrame.mutate_with(fn frame ->
# #Explorer.DataFrame<
# QueryFrame[??? x 3]
# color string
# material boolean # back to boolean here for some reason
# series boolean # back to boolean here for some reason
# >
for col <- [:material, :series] do
{col, Explorer.Series.greater(frame[col], 0)}
end
end)
@LostKobrakai Hm yeah that's also odd. Even though Series.sum
accepts :boolean
, it looks like the dtype of the result isn't being propagated by DataFrame.summarise_with
. I'm sure if it's related yet, but it's certainly a problem.
For your specific instance, you can get around whatever this is with an explicit cast:
colors =
df
|> Explorer.DataFrame.mutate(material: material == "glas", series: series == "C")
|> Explorer.DataFrame.group_by(:color)
|> Explorer.DataFrame.summarise_with(fn frame ->
for col <- [:material, :series] do
{col, frame[col] |> Explorer.Series.cast(:u32) |> Explorer.Series.sum()}
end
end)
|> Explorer.DataFrame.mutate_with(fn frame ->
for col <- [:material, :series] do
{col, Explorer.Series.greater(frame[col], 0)}
end
end)
Or if you don't mind a little code golf:
require Explorer.DataFrame, as: DF
colors =
df
|> DF.mutate(material: material == "glas", series: series == "C")
|> DF.group_by(:color)
|> DF.summarise(for col <- across(["material", "series"]) do
{col.name, sum(cast(col, :u32)) > 0}
end)
This
gives
But this works
Is this expected? decimals are different from floats in this regard?