Rdatatable / data.table

R's data.table package extends data.frame:
http://r-datatable.com
Mozilla Public License 2.0
3.62k stars 985 forks source link

Shouldn't fcase() recycle? #4258

Closed sindribaldur closed 3 months ago

sindribaldur commented 4 years ago

Would be nice if either of these would work

fcase(
  iris$Sepal.Length > 5, ">5",
  iris$Sepal.Length < 4, "<4",
  TRUE, as.character(iris$Sepal.Length)
)
# Results in an error

fcase(
  iris$Sepal.Length > 5, ">5",
  iris$Sepal.Length < 4, "<4",
  default = as.character(iris$Sepal.Length) # Length of 'default' must be 1.
)

My current solution (maybe I am missing some neater solution):

fcase(
  iris$Sepal.Length > 5, ">5",
  iris$Sepal.Length < 4, "<4",
  rep(TRUE, nrow(iris)), as.character(iris$Sepal.Length)
)

dplyr::case_when(
  iris$Sepal.Length > 5 ~ ">5",
  iris$Sepal.Length < 4 ~ "<4",
  TRUE ~  as.character(iris$Sepal.Length)
)
ColeMiller1 commented 4 years ago

Just a few SQL like alternatives:

## the uneven argument is the ELSE somewhat similar to Oracle DECODE()
fcase(
  iris$Sepal.Length > 5, ">5",
  iris$Sepal.Length < 4, "<4",
  as.character(iris$Sepal.Length)
)

## new special symbol similar to at the end of an actual case when statement
fcase(
  iris$Sepal.Length > 5, ">5",
  iris$Sepal.Length < 4, "<4",
  .ELSE, as.character(iris$Sepal.Length)
)

When I first started using dplyr from a SQL background, I kept finding it surprising the else was implemented as TRUE.

2005m commented 4 years ago

I can understand the need for a default vector, however in the example above i would have written the code as follows:

x = iris$Sepal.Length
fcase(
  x > 5, ">5",
  x < 4, "<4",
  x <= 5, as.character(x)
)

That avoids the overhead that is currently being implemented in the PR and mentionned by @jangorecki

2005m commented 4 years ago

It also raises the question. Do we want the same behaviour in fifelse ?

sindribaldur commented 4 years ago

@2005m my example was bad, the default or .ELSE condition can't always be stated so easily.

2005m commented 4 years ago

A good example would be great.

shrektan commented 4 years ago

I've already made the implementations that supports scalar condition and lazy-eval defaults in the PR above. Please take a look there.

sindribaldur commented 4 years ago

Here is a different example. fcase() by no means necessary, but makes for concise and clean-looking code.

DTmtcars[, rn := fcase(
                   rn %like% "^Merc", sub("^Merc", "Mercedes", rn),
                   rn == "Toyota Corona", paste0(rn, "s"),
                   ...
                   rep(TRUE, length(rn)), rn # else just the original vector
                 )]
knapply commented 4 years ago

Here's a simple example that may be helpful.

I have a vector of country names...

countries <- c("USA", "Britain", "Russian Federation", "Trinidad-Tobago", 
               "Bahamas", "Congo", "UAE", "Sao Tome", "Timor-Leste",
               "Canada", "Mexico")

... that I'd like to standardize for downstream tasks.

Let's say that I need to modify everything except for Canada and Mexico. Using dplyr::case_when(), I would do so like this...

dplyr::case_when(
  countries == "USA" ~ "United States of America",
  countries == "Britain" ~ "United Kingdom",
  countries == "Russian Federation" ~ "Russia",
  countries == "Trinidad-Tobago" ~ "Trinidad and Tobago",
  countries == "Bahamas" ~ "The Bahamas",
  countries == "Timor-Leste" ~ "East Timor",
  countries == "UAE" ~ "United Arab Emirates",
  countries == "Congo" ~ "Democratic Republic of the Congo",
  countries == "Sao Tome" ~ "Sao Tome and Principe",
  TRUE ~ countries
)
#>  [1] "United States of America"         "United Kingdom"                  
#>  [3] "Russia"                           "Trinidad and Tobago"             
#>  [5] "The Bahamas"                      "Democratic Republic of the Congo"
#>  [7] "United Arab Emirates"             "Sao Tome and Principe"           
#>  [9] "East Timor"                       "Canada"                          
#> [11] "Mexico"

but data.table::fcase() seems to require finding the values you want to leave as-is separately.

conditions <- c("USA", "Britain", "Russian Federation", "Trinidad-Tobago", 
                "Bahamas", "Congo", "UAE", "Sao Tome", "Timor-Leste")

(dont_modify <- !countries %in% conditions)
#>  [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE

data.table::fcase(
  countries == "USA", "United States of America",
  countries == "Britain", "United Kingdom",
  countries == "Russian Federation", "Russia",
  countries == "Trinidad-Tobago", "Trinidad and Tobago",
  countries == "Bahamas", "The Bahamas",
  countries == "Timor-Leste", "East Timor",
  countries == "UAE", "United Arab Emirates",
  countries == "Congo", "Democratic Republic of the Congo",
  countries == "Sao Tome", "Sao Tome and Principe",
  dont_modify, countries
)
#>  [1] "United States of America"         "United Kingdom"                  
#>  [3] "Russia"                           "Trinidad and Tobago"             
#>  [5] "The Bahamas"                      "Democratic Republic of the Congo"
#>  [7] "United Arab Emirates"             "Sao Tome and Principe"           
#>  [9] "East Timor"                       "Canada"                          
#> [11] "Mexico"
installed.packages()["data.table", "Version"]
#> [1] "1.12.9"
jangorecki commented 4 years ago

@knapply example you presented is perfect example where lookup table should be prefered, it is much easier to maintain and cleaner to use. Then you do countries[countries_new_name, name := i.new_name, on="name"].

knapply commented 4 years ago

I think I must've oversimplified, so here's an expanded example that sticks with the country theme...

library(data.table)

countries <- data.table(
  name = c("Czech Republic", "Czecho-Slovakia", "Mexico", "Czech Republic", 
           "Canada", "Czechoslovakia", "USA", "Britain"),
  year = c(1918, 1990:1996)
); countries
#>               name year
#> 1:  Czech Republic 1918
#> 2: Czecho-Slovakia 1990
#> 3:          Mexico 1991
#> 4:  Czech Republic 1992
#> 5:          Canada 1993
#> 6:  Czechoslovakia 1994
#> 7:             USA 1995
#> 8:         Britain 1996

is_czech_name <- function(x) {
  x %chin% c("Czechoslovak Republic", "Czechoslovakia",
             "Czech Republic", "Czecho-Slovakia")
}

@jangorecki this kind of pattern-matching flexibility is more what I'm getting at.

countries[, name := dplyr::case_when(
  is_czech_name(name) & year <= 1938 ~ "Czechoslovak Republic",
  is_czech_name(name) & year %between% c(1939, 1992) ~ "Czechoslovakia",
  is_czech_name(name) & year >= 1993 ~ "Czech Republic",
  name == "USA" ~ "United States of America",
  name == "Britain" ~ "United Kingdom",
  TRUE ~ name
)]
#>                        name year
#> 1:    Czechoslovak Republic 1918
#> 2:           Czechoslovakia 1990
#> 3:                   Mexico 1991
#> 4:           Czechoslovakia 1992
#> 5:                   Canada 1993
#> 6:           Czech Republic 1994
#> 7: United States of America 1995
#> 8:           United Kingdom 1996
2005m commented 4 years ago

@jangorecki , countries[countries_new_name, name := i.new_name, on="name"] is nice, but : -do you agree that fcase is much faster ? (there is actually a faster solution than fcase) -how do you deal with missing values? (i.e. the default argument or TRUE) ? -does your solution also work when the output values are vectors? Thanks.

jangorecki commented 4 years ago
2005m commented 4 years ago

Yes you are right for large vector [.data.table is around 30% faster than fcase. The fastest solution is a vectorised switch, which cut the time by half on a single thread. Odd for missing values I get NA, so it does not stay as it was.

2005m commented 4 years ago

In case you are interested: https://github.com/2005m/kit . The function is called vswitch. Still working on it.

2005m commented 4 years ago

vswitch now on CRAN along with my original implementation of fcase called nif which allows a vectorised default argument.

2005m commented 4 years ago

@jangorecki , what about adding vswitch to your "join" h2o benchmark?

jangorecki commented 4 years ago

@2005m could you provide syntax for joins of current q1-q5 questions wth vswhitch?

2005m commented 4 years ago

@jangorecki , indeed I can't answer all these questions with vswitch.

iago-pssjd commented 2 years ago

Is there any update on this? Also, the solution proposed in first comment

fcase(
  iris$Sepal.Length > 5, ">5",
  iris$Sepal.Length < 4, "<4",
  rep(TRUE, nrow(iris)), as.character(iris$Sepal.Length)
)

does not work to me since I am using fcase inside [.data.table and by = .() group with different group sizes, so I obtain

Argument #9 has a different length than argument #1. Please make sure all logical conditions have the same length.

Instead, it works replacing rep(TRUE, nrow(iris)) with rep(TRUE, .N)

davidbudzynski commented 1 year ago

Has there been any progress made on this issue? I can confirm that @2005m's nif works exactly as expected.

Fablepongiste commented 1 year ago

This should really be looked into, it is a classical issue and for once dplyr way with case_when so much nicer.