TidierOrg / Tidier.jl

Meta-package for data analysis in Julia, modeled after the R tidyverse.
MIT License
524 stars 14 forks source link

Add `case_when` function #31

Closed zhezhaozz closed 1 year ago

zhezhaozz commented 1 year ago

Checking this blog, Julia can reproduce case_when by using ⋅ ? ⋅ : ⋅ ternary operator. For example, in R code:

> library(dplyr)
> x <- 1:10
> case_when(
+   x %% 35 == 0 ~ "fizz buzz",
+   x %% 5 == 0 ~ "fizz",
+   x %% 7 == 0 ~ "buzz",
+   TRUE ~ as.character(x)
+ )
 [1] "1"    "2"    "3"    "4"    "fizz" "6"    "buzz" "8"    "9"    "fizz"

In Julia code:

julia> x = 1:10
1:10

julia> (x -> x % 6 == 0 ? "fizz buzz" :
             x % 2 == 0 ? "fizz" :
             x % 3 == 0 ? "buzz" :
             string(x)).(x)
10-element Array{String,1}:
 "1"
 "fizz"
 "buzz"
 "fizz"
 "5"
 "fizz buzz"
 "7"
 "fizz"
 "buzz"
 "fizz"

Possible solution

  1. Add a parse_case_when function to transform R's formula into Julia's ternary operator.
  2. Add a case_when that takes a expression and internally calls parse_case_when function.
  3. case_when should enable auto-vectorization and can be handled by parse_autovec parsing function when being called within @mutate.
kdpsingh commented 1 year ago

I came across that same blog post a few weeks ago!

A couple things we need to consider:

zhezhaozz commented 1 year ago

Should case_when() work outside of Tidier.jl or should it be implemented as a pseudo-function?

I think case_when should work outside Tidier.jl because we can take the advantage that @mutate can already handle the user-defined functions and auto-vectorization for a single column

While that blog post is good for a single case, we need to make it easy to vectorize case_when (because I believe the ternary operators require map, list comprehensions, or loops for vectorization).

Yes, I believe we need vectorize ternary operators.

kdpsingh commented 1 year ago

Resolved by PR #41.

Nosferican commented 1 year ago

It seems like ternary functions are still unsupported. For example,

tbl = DataFrame(x = ["alpha", "beta", "charlie", "delta", "echo"])
helper(x) = contains(x, r"a.*a") ? getproperty(match(r"a.*a", x), :match) : "_"
@chain tbl begin
  @mutate(
    count_a = helper(x),
    count_b = case_when(
      contains(x, r"a.*a") => getproperty(match(r"a.*a", x), :match),
      true => "_"
    )
  )
end

Same thing happens with ifelse or if_else since it does not short-circuit leading to errors being thrown.

kdpsingh commented 1 year ago

I think I know what's happening here and will take a look in the near future.

Nosferican commented 1 year ago

Should I open a new issue to track the progress?

kdpsingh commented 1 year ago

Sorry let me re-open this issue. I’m on vacation this week but will look at this next week.

Nosferican commented 1 year ago

Great! Thanks.

kdpsingh commented 1 year ago

The short answer is that if you want to use case_when() or if_else(), both the condition and the return values all have to be valid when vectorized. This isn't necessarily a problem with case_when() or if_else() since R similarly produces an error if one of the underlying conditions produces an error.

In this case, getproperty(match(r"a.*a", x), :match) isn't valid when x is nothing, which results in an error inside of if_else() or case_when(). There may be some value to creating a keyword argument or helper function that replaces errors with missing values (essentially a vectorized try()/catch()), but it's not something we are ready to work on yet.

Ternary operators work fine in Tidier but in order to vectorize them, you have to place them inside an array comprehension (or wrap them in a function). In the future, we may consider wrapping ternary operators inside of an array comprehension automatically, but this gets very tricky to implement correctly/safely.

Both of these examples work okay:

tbl = DataFrame(x = ["alpha", "beta", "charlie", "delta", "echo"])
helper(x) = contains(x, r"a.*a") ? getproperty(match(r"a.*a", x), :match) : "_"

@chain tbl begin
  @mutate(count_a = helper(x))
end

@chain tbl begin
  @mutate(count_a = [contains(x, r"a.*a") ? getproperty(match(r"a.*a", x), :match) : "_" for x in x])
end

I'm going to re-close the issue but feel free to reply if you have thoughts. Appreciate your using the package.