Pattern matching for character columns compared to Factor columns

statquant commented 4 years ago

Hello, I would like to know what is the appropriate way of matching patterns in Character columns. In the vignettes it is clearly explained that Character columns should be preferred over Factor columns. I know of the like function in the package but its performance seems fairly low in the bellow problem where I try to detect a pattern in a Character column and a Factor column. Is there a recommended way to do pattern matching on character columns in data.table

library(data.table)                                                                                            
library(stringr)                                                                                               
library(microbenchmark)

set.seed(1L)                                                                                                   
n_disctinct <- 100L                                                                                            
n_letters <- 5L                                                                                                
x <- replicate(n_disctinct, paste(sample(LETTERS, n_letters, replace = TRUE), collapse = ""))                  
n_length <- 1e6L                                                                                               
x <- sample(x = x, size = n_length, replace = TRUE)                                                            
f <- factor(x)                                                                                                 
DT <- data.table(f = f, x = x)                                                                                 

fct_detect <- function(x, ...) {                                                                               
    stringr::str_detect(levels(x), ...)[x]                                                                     
}                                                                                                              

microbenchmark(                                                                                                
    DT1 <- DT[x %like% "SSA"],                 # the like function on Character                                                                
    DT2 <- DT[fct_detect(f, "SSA")],            # naive detect function on Factor using str_detect on the levels                                                               
    DT3 <- DT[f %like% "SSA"],                  # the like function on Factor                                                                
    DT4 <- DT[x %chin% "SSATD"],                                                                               
    DT5 <- DT[x == "SSATD"],                                                                                   
    DT6 <- DT[f == "SSATD"],                                                                                   
    DT7 <- DT[(levels(f) == "SSATD")[f]],                                                                      
    times = 50L                                                                                                
    ) -> res                                                                                                   
res                                                                                                            
# Unit: milliseconds                                                                                           
#                                  expr        min         lq       mean     median         uq        max neval
#             DT1 <- DT[x %like% "SSA"] 137.224022 139.326541 153.066482 141.602557 148.522236 237.255871    50
#       DT2 <- DT[fct_detect(f, "SSA")]   4.593743   4.896224   5.717059   5.080999   5.346516   9.457377    50
#             DT3 <- DT[f %like% "SSA"]  19.556630  21.070560  24.242968  21.944684  29.048203  32.234688    50
#           DT4 <- DT[x %chin% "SSATD"]   1.477981   1.629233   2.060127   1.709921   2.850643   3.092975    50
#               DT5 <- DT[x == "SSATD"]   1.480074   1.654964   2.023678   1.710184   2.813284   3.068752    50
#               DT6 <- DT[f == "SSATD"]   4.523973   4.854614   5.994524   5.023527   6.237378   9.392769    50
#  DT7 <- DT[(levels(f) == "SSATD")[f]]   4.462589   4.716833   5.786381   5.031112   5.205085   9.351510    50

sapply(list(DT2, DT3), FUN = identical, DT1)       
# [1] TRUE TRUE                                    
sapply(list(DT5, DT6, DT7), FUN = identical, DT4)  
# [1] TRUE TRUE TRUE

shrektan commented 4 years ago

For problems with lots of duplicated strings, matching regular expression on the unique levels then mapping back is much faster, since the number of unique levels if fairly small. Under such circumstances, I suggest to use the factor column.

(In fact, converting characters to factors takes time. I mean DT2 <- DT[fct_detect(factor(x), "SSA")] may cost significantly more time to run)

In addition, the reason of DT2 being much faster than DT3 is because your implementation is faster by subsetting the result directly, saving the time from coercing the factor values into integer values + matching again.

https://github.com/Rdatatable/data.table/blob/7bf5748d725d3a7d74da7c6b5137bd0ea87a17bd/R/like.R#L5-L7

From ?[ :

The index object i can be numeric, logical, character or empty. Indexing by factors is allowed and is equivalent to indexing by the numeric codes (see factor) and not by the character values which are printed (for which use [as.character(i)]).

I think we should improve this case based on your findings.

statquant commented 4 years ago

Hello, it seems that you beat me to the PR :\

Thanks for your answer, can I digress a bit on this please ?

Let's run the same test as above but without the data.table subseting

set.seed(1L)                                                                                               
n_disctinct <- 100L                                                                                        
n_letters <- 5L                                                                                            
x <- replicate(n_disctinct, paste(sample(LETTERS, n_letters, replace = TRUE), collapse = ""))              
n_length <- 1e6L                                                                                           
x <- sample(x = x, size = n_length, replace = TRUE)                                                        
f <- factor(x)                                                                                             
microbenchmark(                                                                                            
    DT1 <- x %like% "SSA",                                                                                 
    DT2 <- fct_detect(f, "SSA"),                                                                           
    DT3 <- f %like% "SSA",                                                                                 
    DT4 <- x %chin% "SSATD",                                                                               
    DT5 <- x == "SSATD",                                                                                   
    DT6 <- f == "SSATD",                                                                                   
    DT7 <- (levels(f) == "SSATD")[f],                                                                      
    times = 50L                                                                                            
    ) -> res                                                                                               
res                                                                                                        
# Unit: milliseconds                                                                                       
#                              expr        min         lq       mean     median         uq        max neval
#             DT1 <- x %like% "SSA" 134.184019 134.799904 138.733458 135.376793 140.121385 164.333579    50
#       DT2 <- fct_detect(f, "SSA")   2.530200   2.609741   2.779877   2.679097   2.853188   4.179983    50
#             DT3 <- f %like% "SSA"  17.277338  17.505502  18.561721  17.905504  18.664402  25.701312    50
#           DT4 <- x %chin% "SSATD"   8.996969   9.081639   9.731109   9.418635   9.816365  15.183587    50
#               DT5 <- x == "SSATD"   4.909027   4.945915   5.174743   5.065182   5.291422   6.290251    50
#               DT6 <- f == "SSATD"   2.469650   2.542999   2.640223   2.569298   2.665493   3.129144    50
#  DT7 <- (levels(f) == "SSATD")[f]   2.443671   2.502199   2.569179   2.533719   2.600884   3.097417    50

You can see that

matching (equality) on Factor should be twice as fast than on Character however when you add the data.table "subsetting" Character becomes faster (and even better it is actually faster than the Character matching itself). Looking at htop I see that there seem to be multithreading at play.
Am I right ? If yes
Why isn't there the same thing implemented for Factors ?

In my work we typically have tables of a few 10s millions rows and characters are usually "symbols (ie strings)" with low cardinality (100s unique). Industry standard is to use some factor-like objects. I red in different places (typically this well know SO q) that I should prefer Character. I feel like for pattern matching, data.table is at a loss using strings vs factors, and somehow factors seems less optimized and I do not feel I can drop character for factors.

MichaelChirico commented 4 years ago

In the vignettes it is clearly explained that Character columns should be preferred over Factor columns.

Could you point to this? I would want to reword this as I don't think this is the case. Especially with all the internal parallelism in data.table that doesn't work on character columns. Don't think we can give any arbitrary advice but in general low-cardinality strings should be kept as factors IMO.

statquant commented 4 years ago

That's not on a vignette, that's on ?setkey

Despite its name, base::sort.list(x,method="radix") actually invokes a counting sort in R, not a radix sort. See do_radixsort in src/main/sort.c. A counting sort, however, is particularly suitable for sorting integers and factors, and we like it. In fact we like it so much that data.table contains a counting sort algorithm for character vectors using R's internal global string cache. This is particularly fast for character vectors containing many duplicates, such as grouped data in a key column. This means that character is often preferred to factor. Factors are still fully supported, in particular ordered factors (where the levels are not in alphabetic order).

I see you actually amended this SO post

I'd like to have a clearer picture of performance for factor vs character for cases where cardinality is small (which would be typical in say finance), is there such a document somewhere ? if not I can create a Rmd testing for J(), ==, pattern matching, merging if you guys think it could be useful (in that case I can commit the document somewhere).

MichaelChirico commented 4 years ago

I do think that would be quite useful. I'm not sure if vignette is the right medium for it (it would be our first such vignette) -- would a blog post be more appropriate?

statquant commented 1 year ago

Coming by chance from the NEWS.md 1.14.7 file where this is still referenced as a new feature in development it looks like this has been merged for quite some time. I hope it is fine for me to close it.

statquant commented 1 year ago

Actually I am confused as I checked https://github.com/Rdatatable/data.table/pull/4750/files#diff-88384aa2504222157297b908ad7dc25684c172e11a9da29bcf848283e05290d6 and although I understand the PR has been merged (a long time ago) it does not seem to be in the latest cran version (I checked like.R in the package source available on cran). Can I ask a genuine question (please don't take it the wrong way, I am just trying to understand how things work and although I checked the PR I'm sure I understand the workflow): why is this change that seem to have been accepted a long time ago has not been released as part of a stable (I mean on cran) version ?

shrektan commented 1 year ago

This is something that confused me as well. Some fixes that have been merged for more than 1 year are not in CRAN yet, as stated in https://github.com/Rdatatable/data.table/issues/5538#issuecomment-1330088746 .

Rdatatable / data.table

Pattern matching for character columns compared to Factor columns #4748