elixir-explorer / explorer

Series (one-dimensional) and dataframes (two-dimensional) for fast and elegant data exploration in Elixir
https://hexdocs.pm/explorer
MIT License
1.12k stars 123 forks source link

Use case: Conditionally mutate data #978

Closed ryancurtin closed 2 months ago

ryancurtin commented 2 months ago

My use case involves mutating data in a particular column, but only if a certain condition matches. Mutate supports this functionality now, but since my rules can come from an external data source, I'd need to do this programmatically. Polars has the when function for this operation.

I'd like to avoid creating an intermediate column to evaluate my conditions first or using Series.transform so I was wondering if there's a recommended approach. I'm familiar with the library and I have combed through the docs quite a bit, but it wouldn't be the first time I've overlooked something obvious.

billylanchantin commented 2 months ago

Does cond work?

If not, can you provide an example?

ryancurtin commented 2 months ago

Thanks for your response. cond may not really be what I want, as it has lots of the same drawbacks as mutate. I've created a simple example to further elaborate on my use case:

# Stored in DB, can have complicated logic (i.e. multiple conditons, and/or evaluation, etc).
# For simplicity's sake, here are two very basic rules
rules = [
  %Rule{column: "b", conditions: [%{column: "a", operator: ">", value: 100}], match_value: "test", result_value: "good"},
  %Rule{column: "b", conditions: [%{column: "a", operator: ">", value: 500}], match_value: "good", result_value: "great"},
]

df = DF.new(%{
  a: [10, 100, 1000],
  b: ["test", "test", "test"]
})

Enum.reduce(rules, df, fn %Rule{column: column, conditions: conditions, match_value: match, result_value: result}, acc_df ->
  DF.mutate_with(acc_df, fn ldf ->
    # Pseudo-code for what I'd want to happen below - I currently have code to parse and evaluate 
    # the rules into a boolean Series, but with no way of using it in `mutate_with`.
    conditions_passed = Evaluator.evaluate_rules(conditions, ldf)

    if conditions_passed do
      %{column => Series.replace(ldf[column], match, result)}
    else
      %{}
    end
  end)
end)

The conditions for rules don't have any bespoke operations. It'd be just equality checks and comparison operators such as >, <, >=, <=, etc. With that being the case, I was optimistic there is a solution buried somewhere in the existing Explorer API. Please let me know if you'd like any further clarification.

josevalim commented 2 months ago

Thanks for the example. The cond/if/else inside mutate compiles down to Polars when/then/else, so it is not clear to me how they are different or why cond is not enough. Can you expand your executable code to something we can actually execute, perhaps a Mix.install([:explorer]) script, so we can play with it and try to find a working solution?

ryancurtin commented 2 months ago

Sure thing, I appreciate you looking into this further. I've created a gist with my example: https://gist.github.com/ryancurtin/f6c6c00450528dd8dd9f094f9472e3a3

As I noted in the comments of the code, the biggest issue I have is being able to properly represent conditions in an Explorer.Query. I tried to demonstrate a reasonably complex rule (an OR comparison between the two of them), but maybe the heart of the problem is if it's possible (or even recommended) to generate a complex query with a dynamic operation (i.e. >, <, >=, <= can be specified with a variable) like that programmatically. Essentially I'd want that Evaluator module in my example to produce a Query instead of a Series.

billylanchantin commented 2 months ago

For your final evaluation -- the TODO in your gist -- I believe you're looking for Series.select/3.

Doing if boolean_series do ... won't work because boolean_series contains many boolean values. So unless it makes sense in your use case to use Series.all?/1 or Series.any?/1 to collapse the boolean_series down to a single boolean (and I don't think it does), you'll need to let select decide which value to take on a row-by-row basis.

As for the confusion around how to iteratively build up a Explorer.Query based on multiple RuleObjects: I'd say there isn't a "elegant" way to do what you're trying to accomplish right now (if I've understood it). Today, our Explorer.Query construct works based off callbacks. So your reducer would need to iteratively wrap the accumulated callback in another callback. Something like this:

lazy_fun =
  Enum.reduce(rules, fn x -> x end, fn rule, acc_fun ->
    rule_fun = turn_rule_into_fun(rule)
    fn x -> x |> acc_fun.() |> rule_fun.() end
  end)

DF.mutate_with(df, lazy_fun)

Final thought: your gist was a fairly long and I admit I only skimmed it. In future, try simplifying your examples as much as possible. Minimal working examples (MWEs) are extremely valuable. ATM, I'm still not sure if I fully understood what you goal is and have made my best guess.

ryancurtin commented 2 months ago

@billylanchantin - I can confirm that Series.select/3 is exactly what I want! I was trying to perform string replacement on an individual element in a Series only if a bunch of boolean logic (based on either a scalar value or an element in another Series) evaluated to true. Series.select/3 seems to give me that in spades. I'm not sure how I overlooked that, though I guess the name of the function is fairly generic at first glance.

Thanks for confirming my interpretation of query, and that what I was looking to do is also possible building up functions.

RE: Examples - definitely noted for the future. I'd say my first code snippet had most of the relevant code and I hadn't quite conveyed enough about what I was trying to do. The gist just had context around the evaluation of rules and thus wasn't entirely necessary to understand the problem (skeleton functions with typespecs probably would have been enough).

josevalim commented 2 months ago

Glad to hear it has been addressed!

Btw, regarding the examples/gist, I think if the first example was executable, it would have been perfect in size. :) But I am glad we could all build on top of each other's feedback.

I was also able to make this work:

require Explorer.DataFrame, as: DF

rules = [
  %{
    column: "b",
    conditions: [%{column: "a", operator: "greater", value: 100}],
    match_value: "test",
    result_value: "good"
  },
  %{
    column: "b",
    conditions: [%{column: "a", operator: "greater", value: 500}],
    match_value: "good",
    result_value: "great"
  }
]

df =
  DF.new(%{
    a: [10, 100, 1000],
    b: ["test", "test", "test"]
  })

defmodule Evaluator do
  def evaluate(df, conditions) do
    conditions
    |> Enum.map(fn %{column: column, operator: op, value: value} ->
      apply(Explorer.Series, String.to_existing_atom(op), [df[column], value])
    end)
    |> Enum.reduce(&Explorer.Series.and/2)
  end
end

Enum.reduce(rules, df, fn %{
                            column: column,
                            conditions: conditions,
                            match_value: match,
                            result_value: result
                          },
                          acc_df ->
  DF.mutate(
    acc_df,
    [
      {^column,
       if Evaluator.evaluate(df(), ^conditions) do
         select(col(^column) == ^match, ^result, col(^column))
       else
         col(^column)
       end}
    ]
  )
end)
|> IO.inspect()

But it requires pushing a df() macro to Explorer.Query, which I will do soon. But @billylanchantin nailed it, you don't need the conditional above, you could instead write it as select(Series.and(condition, df[column] == match)) and that will likely be clearer anyway.

ryancurtin commented 2 months ago

@josevalim - Appreciate the detailed clarification! The df() macro was the construct I was missing, though I'm not sure I would have been able to properly articulate that. Knowing this is all possible with mutate (and your response about compiling down to when/then/else) helps a lot!