Closed billylanchantin closed 2 months ago
I am not sure I agree. Wouldn’t that be the same as a DF.filter before hand? In any case, we should at least improve the error message. :)
In this case it'd be the same, but mine is just a minimal example. The original example from elixirforum isn't equivalent.
I don't see why we shouldn't support it. But if we can't for some reason, then definitely an improved error message is the way to go.
I am not sure I agree. Wouldn’t that be the same as a DF.filter before hand? In any case, we should at least improve the error message. :)
The group_by makes DF.filter not entirely viable without backfilling some column values after the fact.
For example, currently our approach looks like this. in the future, we will also have 3 more of these aggregations
I have to get the distinct values of the sim_idx
to use in a join later, so that we can backfil any of that group that the drop_nil
removes entirely.
I believe that filtering a series inside summarise would make that
really what i want to do for each column of interest inside the group is "give me the first not nil value or if the series only has nil, then 'none'."
sim_idx = data_frame |> DataFrame.distinct([:sim_idx])
data_frame =
any_data_frame
|> DataFrame.mutate(
any_id:
if result in ["one", "two", "three", "four"] do
person_id
else
nil
end
)
|> DataFrame.drop_nil([:any_id])
|> DataFrame.group_by(["sim_idx"])
|> DataFrame.summarise(any: first(any_id))
|> DataFrame.join(sim_idx, on: [:sim_idx], how: :right)
two_data_frame =
data_frame
|> DataFrame.mutate(
two_id:
if result == "two" do
person_id
else
nil
end
)
|> DataFrame.drop_nil([:two_id])
|> DataFrame.group_by(["sim_idx"])
|> DataFrame.summarise(two: first(two_id))
|> DataFrame.join(sim_idx, on: [:sim_idx], how: :right)
DataFrame.join(any_data_frame, two_data_frame, on: [:sim_idx])
|> DataFrame.mutate(
any: fill_missing(any, "none"),
two: fill_missing(two, "none")
)
I might be misunderstanding, but the dplyr docs seems to imply that their API can do grouped filtering: https://dplyr.tidyverse.org/articles/grouping.html?q=summ#filter
but, as I send that, I see that DF.filter works with groups... which is what i think Jose was saying.
let me try that out 🤦
Yeah so that method can work, but seems like my previous workaround just rearranged.
I think the key thing that that the call to DF.summarise after the call to DF.filter will not summarise any grouped values if they were filtered out.
I think we can close this one for now and revisit later. :)
Originally noted here:
Example:
yields: