`Series.filter` should work inside `DataFrame.summarise`

billylanchantin commented 5 months ago

Originally noted here:

https://elixirforum.com/t/using-series-filter-inside-dataframe-summarise/64166

Example:

require Explorer.DataFrame, as: DF

DF.new(a: [1, 2, 2], b: ["x", "y", "z"])
|> DF.group_by(:a)
|> DF.summarise(c: filter(b, _ != "z"))

yields:

** (ArgumentError) expected a variable to be given to var!, got: Explorer.DataFrame.pull(var!(df, Explorer.Query), :df)
    (elixir 1.16.0) expanding macro: Kernel.var!/2
    iex:5: (file)
    (explorer 0.8.3-dev) expanding macro: Explorer.Query.query/1
    iex:5: (file)
    (elixir 1.16.0) expanding macro: Kernel.|>/2
    iex:5: (file)
    (elixir 1.16.0) expanding macro: Kernel.|>/2
    iex:5: (file)

josevalim commented 5 months ago

I am not sure I agree. Wouldn’t that be the same as a DF.filter before hand? In any case, we should at least improve the error message. :)

billylanchantin commented 5 months ago

In this case it'd be the same, but mine is just a minimal example. The original example from elixirforum isn't equivalent.

I don't see why we shouldn't support it. But if we can't for some reason, then definitely an improved error message is the way to go.

mhanberg commented 5 months ago

I am not sure I agree. Wouldn’t that be the same as a DF.filter before hand? In any case, we should at least improve the error message. :)

The group_by makes DF.filter not entirely viable without backfilling some column values after the fact.

For example, currently our approach looks like this. in the future, we will also have 3 more of these aggregations

I have to get the distinct values of the sim_idx to use in a join later, so that we can backfil any of that group that the drop_nil removes entirely.

I believe that filtering a series inside summarise would make that

really what i want to do for each column of interest inside the group is "give me the first not nil value or if the series only has nil, then 'none'."

    sim_idx = data_frame |> DataFrame.distinct([:sim_idx])

    data_frame =
      any_data_frame
      |> DataFrame.mutate(
        any_id:
          if result in ["one", "two", "three", "four"] do
            person_id
          else
            nil
          end
      )
      |> DataFrame.drop_nil([:any_id])
      |> DataFrame.group_by(["sim_idx"])
      |> DataFrame.summarise(any: first(any_id))
      |> DataFrame.join(sim_idx, on: [:sim_idx], how: :right)

    two_data_frame =
      data_frame
      |> DataFrame.mutate(
        two_id:
          if result == "two" do
            person_id
          else
            nil
          end
      )
      |> DataFrame.drop_nil([:two_id])
      |> DataFrame.group_by(["sim_idx"])
      |> DataFrame.summarise(two: first(two_id))
      |> DataFrame.join(sim_idx, on: [:sim_idx], how: :right)

    DataFrame.join(any_data_frame, two_data_frame, on: [:sim_idx])
    |> DataFrame.mutate(
      any: fill_missing(any, "none"),
      two: fill_missing(two, "none")
    )

I might be misunderstanding, but the dplyr docs seems to imply that their API can do grouped filtering: https://dplyr.tidyverse.org/articles/grouping.html?q=summ#filter

CleanShot 2024-06-13 at 14 23 27@2x

mhanberg commented 5 months ago

but, as I send that, I see that DF.filter works with groups... which is what i think Jose was saying.

let me try that out 🤦

mhanberg commented 5 months ago

Yeah so that method can work, but seems like my previous workaround just rearranged.

I think the key thing that that the call to DF.summarise after the call to DF.filter will not summarise any grouped values if they were filtered out.

josevalim commented 2 months ago

I think we can close this one for now and revisit later. :)

elixir-explorer / explorer

`Series.filter` should work inside `DataFrame.summarise` #927