Closed sindribaldur closed 3 months ago
Just a few SQL like alternatives:
## the uneven argument is the ELSE somewhat similar to Oracle DECODE()
fcase(
iris$Sepal.Length > 5, ">5",
iris$Sepal.Length < 4, "<4",
as.character(iris$Sepal.Length)
)
## new special symbol similar to at the end of an actual case when statement
fcase(
iris$Sepal.Length > 5, ">5",
iris$Sepal.Length < 4, "<4",
.ELSE, as.character(iris$Sepal.Length)
)
When I first started using dplyr from a SQL background, I kept finding it surprising the else was implemented as TRUE.
I can understand the need for a default vector, however in the example above i would have written the code as follows:
x = iris$Sepal.Length
fcase(
x > 5, ">5",
x < 4, "<4",
x <= 5, as.character(x)
)
That avoids the overhead that is currently being implemented in the PR and mentionned by @jangorecki
It also raises the question. Do we want the same behaviour in fifelse
?
@2005m my example was bad, the default or .ELSE
condition can't always be stated so easily.
A good example would be great.
I've already made the implementations that supports scalar condition and lazy-eval defaults in the PR above. Please take a look there.
Here is a different example. fcase()
by no means necessary, but makes for concise and clean-looking code.
DTmtcars[, rn := fcase(
rn %like% "^Merc", sub("^Merc", "Mercedes", rn),
rn == "Toyota Corona", paste0(rn, "s"),
...
rep(TRUE, length(rn)), rn # else just the original vector
)]
Here's a simple example that may be helpful.
I have a vector of country names...
countries <- c("USA", "Britain", "Russian Federation", "Trinidad-Tobago",
"Bahamas", "Congo", "UAE", "Sao Tome", "Timor-Leste",
"Canada", "Mexico")
... that I'd like to standardize for downstream tasks.
Let's say that I need to modify everything except for Canada and Mexico. Using dplyr::case_when()
, I would do so like this...
dplyr::case_when(
countries == "USA" ~ "United States of America",
countries == "Britain" ~ "United Kingdom",
countries == "Russian Federation" ~ "Russia",
countries == "Trinidad-Tobago" ~ "Trinidad and Tobago",
countries == "Bahamas" ~ "The Bahamas",
countries == "Timor-Leste" ~ "East Timor",
countries == "UAE" ~ "United Arab Emirates",
countries == "Congo" ~ "Democratic Republic of the Congo",
countries == "Sao Tome" ~ "Sao Tome and Principe",
TRUE ~ countries
)
#> [1] "United States of America" "United Kingdom"
#> [3] "Russia" "Trinidad and Tobago"
#> [5] "The Bahamas" "Democratic Republic of the Congo"
#> [7] "United Arab Emirates" "Sao Tome and Principe"
#> [9] "East Timor" "Canada"
#> [11] "Mexico"
but data.table::fcase()
seems to require finding the values you want to leave as-is separately.
conditions <- c("USA", "Britain", "Russian Federation", "Trinidad-Tobago",
"Bahamas", "Congo", "UAE", "Sao Tome", "Timor-Leste")
(dont_modify <- !countries %in% conditions)
#> [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE
data.table::fcase(
countries == "USA", "United States of America",
countries == "Britain", "United Kingdom",
countries == "Russian Federation", "Russia",
countries == "Trinidad-Tobago", "Trinidad and Tobago",
countries == "Bahamas", "The Bahamas",
countries == "Timor-Leste", "East Timor",
countries == "UAE", "United Arab Emirates",
countries == "Congo", "Democratic Republic of the Congo",
countries == "Sao Tome", "Sao Tome and Principe",
dont_modify, countries
)
#> [1] "United States of America" "United Kingdom"
#> [3] "Russia" "Trinidad and Tobago"
#> [5] "The Bahamas" "Democratic Republic of the Congo"
#> [7] "United Arab Emirates" "Sao Tome and Principe"
#> [9] "East Timor" "Canada"
#> [11] "Mexico"
installed.packages()["data.table", "Version"]
#> [1] "1.12.9"
@knapply example you presented is perfect example where lookup table should be prefered, it is much easier to maintain and cleaner to use. Then you do countries[countries_new_name, name := i.new_name, on="name"]
.
I think I must've oversimplified, so here's an expanded example that sticks with the country theme...
library(data.table)
countries <- data.table(
name = c("Czech Republic", "Czecho-Slovakia", "Mexico", "Czech Republic",
"Canada", "Czechoslovakia", "USA", "Britain"),
year = c(1918, 1990:1996)
); countries
#> name year
#> 1: Czech Republic 1918
#> 2: Czecho-Slovakia 1990
#> 3: Mexico 1991
#> 4: Czech Republic 1992
#> 5: Canada 1993
#> 6: Czechoslovakia 1994
#> 7: USA 1995
#> 8: Britain 1996
is_czech_name <- function(x) {
x %chin% c("Czechoslovak Republic", "Czechoslovakia",
"Czech Republic", "Czecho-Slovakia")
}
@jangorecki this kind of pattern-matching flexibility is more what I'm getting at.
countries[, name := dplyr::case_when(
is_czech_name(name) & year <= 1938 ~ "Czechoslovak Republic",
is_czech_name(name) & year %between% c(1939, 1992) ~ "Czechoslovakia",
is_czech_name(name) & year >= 1993 ~ "Czech Republic",
name == "USA" ~ "United States of America",
name == "Britain" ~ "United Kingdom",
TRUE ~ name
)]
#> name year
#> 1: Czechoslovak Republic 1918
#> 2: Czechoslovakia 1990
#> 3: Mexico 1991
#> 4: Czechoslovakia 1992
#> 5: Canada 1993
#> 6: Czech Republic 1994
#> 7: United States of America 1995
#> 8: United Kingdom 1996
@jangorecki , countries[countries_new_name, name := i.new_name, on="name"]
is nice, but :
-do you agree that fcase
is much faster ? (there is actually a faster solution than fcase
)
-how do you deal with missing values? (i.e. the default argument or TRUE
) ?
-does your solution also work when the output values are vectors?
Thanks.
[.data.table
call. Update on join will use binary merge to match new values to rows, so it will scale well. fcase is lazy so can potentially avoid a lot of computation. There are many factors, but I think scaling up will be in favor of update-on-join.name
is not being updated at all, so stays as it was.counties_new_name
needs to be a list/data.tableYes you are right for large vector [.data.table
is around 30% faster than fcase
.
The fastest solution is a vectorised switch, which cut the time by half on a single thread.
Odd for missing values I get NA
, so it does not stay as it was.
In case you are interested: https://github.com/2005m/kit . The function is called vswitch
. Still working on it.
vswitch
now on CRAN along with my original implementation of fcase
called nif
which allows a vectorised default
argument.
@jangorecki , what about adding vswitch
to your "join" h2o benchmark?
@2005m could you provide syntax for joins of current q1-q5 questions wth vswhitch
?
@jangorecki , indeed I can't answer all these questions with vswitch
.
Is there any update on this? Also, the solution proposed in first comment
fcase(
iris$Sepal.Length > 5, ">5",
iris$Sepal.Length < 4, "<4",
rep(TRUE, nrow(iris)), as.character(iris$Sepal.Length)
)
does not work to me since I am using fcase
inside [.data.table
and by = .()
group with different group sizes, so I obtain
Argument #9 has a different length than argument #1. Please make sure all logical conditions have the same length.
Instead, it works replacing rep(TRUE, nrow(iris))
with rep(TRUE, .N)
Has there been any progress made on this issue? I can confirm that @2005m's nif
works exactly as expected.
This should really be looked into, it is a classical issue and for once dplyr way with case_when so much nicer.
Would be nice if either of these would work
My current solution (maybe I am missing some neater solution):