elixir-explorer / explorer

Series (one-dimensional) and dataframes (two-dimensional) for fast and elegant data exploration in Elixir
https://hexdocs.pm/explorer
MIT License
1.12k stars 123 forks source link

mismatched dtypes with decimal and summarise #1018

Open dvic opened 6 days ago

dvic commented 6 days ago

This

require Explorer.DataFrame

Explorer.DataFrame.new([
  {"field_a", Explorer.Series.from_list([], dtype: :s64)},
  {"field_b", Explorer.Series.from_list([], dtype: :s64)},
  {"field_c", Explorer.Series.from_list([], dtype: {:decimal, 18, 2})},
])
|> Explorer.DataFrame.summarise(summy: sum(field_a * field_b) * field_c)

gives

(ArgumentError) cannot invoke Explorer.Series.multiply/2 with mismatched dtypes: {:s, 64} and :null
    (explorer 0.10.0) lib/explorer/series.ex:6814: Explorer.Series.dtype_mismatch_error/3

But this works

require Explorer.DataFrame

Explorer.DataFrame.new([
  {"field_a", Explorer.Series.from_list([], dtype: :s64)},
  {"field_b", Explorer.Series.from_list([], dtype: :s64)},
  {"field_c", Explorer.Series.from_list([], dtype: {:decimal, 18, 2})},
])
|> Explorer.DataFrame.summarise(summy: sum(field_a * field_b) * cast(field_c, :f64))

Is this expected? decimals are different from floats in this regard?

billylanchantin commented 5 days ago

Thanks for the report!

No this is not expected. Both should work.

Curiously, we just fixed a bug where empty lists ended up with the :null dtype instead of the intended :decimal dtype. That's what I expected to be going on here. But on main, now both of your examples fail like this:

** (RuntimeError) DataFrame mismatch.

expected:

    names: ["summy"]
    dtypes: %{"summy" => {:f, 64}}

got:

    names: ["summy"]
    dtypes: %{"summy" => {:list, :null}}

    (explorer 0.11.0-dev) lib/explorer/polars_backend/shared.ex:61: Explorer.PolarsBackend.Shared.apply_dataframe/4
    (explorer 0.11.0-dev) lib/explorer/polars_backend/data_frame.ex:895: Explorer.PolarsBackend.DataFrame.summarise_with/3
    iex:4: (file)

Not sure what's going on there yet. Will have to dig deeper.

dvic commented 3 days ago

Not sure if it's related, but it looks like the mean function doesn't work with decimals in dataframes? (while the docs mention that it's supported: https://hexdocs.pm/explorer/Explorer.Series.html#mean/1-supported-dtypes)

alias Explorer.Series, as: S
require Explorer.DataFrame, as: DF

DF.new([
  {"field_a", S.from_list([1], dtype: :s64)},
  {"field_b", S.from_list([1], dtype: {:decimal, 10, 2})},
  {"field_c", S.from_list([1], dtype: {:decimal, 10, 2})}
])
|> DF.summarise(a_avg: mean(field_a), b_avg: mean(field_b), c_avg: mean(cast(field_c, :f64)))

gives

#Explorer.DataFrame<
  Polars[1 x 3]
  a_avg f64 [1.0]
  b_avg decimal[10, 2] [nil]
  c_avg f64 [0.01]
>

While

s = Explorer.Series.from_list([1], dtype: {:decimal, 10, 2})

gives

#Explorer.Series<
  Polars[1]
  decimal[10, 2] [0.01]
>
LostKobrakai commented 16 hours ago

I'm looking at some similar symptoms, whith might be related:

df =
  [
    %{color: "red", material: ["plastic", "glas"], series: ["A", "B", "C"]},
    %{color: "blue", material: ["plastic"], series: ["A", "B", "C"]},
    %{color: "green", material: ["plastic", "glas"], series: ["A", "B"]},
    %{color: "stardust", material: ["plastic"], series: ["A"]}
  ]
  |> Explorer.DataFrame.new()
  |> Explorer.DataFrame.explode(:material)
  |> Explorer.DataFrame.explode(:series)

colors =
  df
  |> Explorer.DataFrame.mutate(material: material == "glas", series: series == "C")
  |> Explorer.DataFrame.group_by(:color)
  |> Explorer.DataFrame.summarise_with(fn frame ->
    for col <- [:material, :series] do
      {col, Explorer.Series.sum(frame[col])}
    end
  end)
  # #Explorer.DataFrame<
  #   Polars[4 x 3]
  #   color string ["red", "blue", "green", "stardust"]
  #   material u32 [3, 0, 2, 0]
  #   series u32 [2, 1, 0, 0]
  # >
  |> Explorer.DataFrame.mutate_with(fn frame ->
    # #Explorer.DataFrame<
    #   QueryFrame[??? x 3]
    #   color string
    #   material boolean # back to boolean here for some reason
    #   series boolean # back to boolean here for some reason
    # >

    for col <- [:material, :series] do
      {col, Explorer.Series.greater(frame[col], 0)}
    end
  end)
billylanchantin commented 12 hours ago

@LostKobrakai Hm yeah that's also odd. Even though Series.sum accepts :boolean, it looks like the dtype of the result isn't being propagated by DataFrame.summarise_with. I'm sure if it's related yet, but it's certainly a problem.

For your specific instance, you can get around whatever this is with an explicit cast:

colors =
  df
  |> Explorer.DataFrame.mutate(material: material == "glas", series: series == "C")
  |> Explorer.DataFrame.group_by(:color)
  |> Explorer.DataFrame.summarise_with(fn frame ->
    for col <- [:material, :series] do
      {col, frame[col] |> Explorer.Series.cast(:u32) |> Explorer.Series.sum()}
    end
  end)
  |> Explorer.DataFrame.mutate_with(fn frame ->
    for col <- [:material, :series] do
      {col, Explorer.Series.greater(frame[col], 0)}
    end
  end)

Or if you don't mind a little code golf:

require Explorer.DataFrame, as: DF

colors =
  df
  |> DF.mutate(material: material == "glas", series: series == "C")
  |> DF.group_by(:color)
  |> DF.summarise(for col <- across(["material", "series"]) do
    {col.name, sum(cast(col, :u32)) > 0}
  end)