Open statquant opened 4 years ago
For problems with lots of duplicated strings, matching regular expression on the unique levels then mapping back is much faster, since the number of unique levels if fairly small. Under such circumstances, I suggest to use the factor column.
(In fact, converting characters to factors takes time. I mean DT2 <- DT[fct_detect(factor(x), "SSA")]
may cost significantly more time to run)
In addition, the reason of DT2 being much faster than DT3 is because your implementation is faster by subsetting the result directly, saving the time from coercing the factor values into integer values + matching again.
From ?[
:
The index object i can be numeric, logical, character or empty. Indexing by factors is allowed and is equivalent to indexing by the numeric codes (see factor) and not by the character values which are printed (for which use [as.character(i)]).
I think we should improve this case based on your findings.
Hello, it seems that you beat me to the PR :\
Thanks for your answer, can I digress a bit on this please ?
Let's run the same test as above but without the data.table subseting
set.seed(1L)
n_disctinct <- 100L
n_letters <- 5L
x <- replicate(n_disctinct, paste(sample(LETTERS, n_letters, replace = TRUE), collapse = ""))
n_length <- 1e6L
x <- sample(x = x, size = n_length, replace = TRUE)
f <- factor(x)
microbenchmark(
DT1 <- x %like% "SSA",
DT2 <- fct_detect(f, "SSA"),
DT3 <- f %like% "SSA",
DT4 <- x %chin% "SSATD",
DT5 <- x == "SSATD",
DT6 <- f == "SSATD",
DT7 <- (levels(f) == "SSATD")[f],
times = 50L
) -> res
res
# Unit: milliseconds
# expr min lq mean median uq max neval
# DT1 <- x %like% "SSA" 134.184019 134.799904 138.733458 135.376793 140.121385 164.333579 50
# DT2 <- fct_detect(f, "SSA") 2.530200 2.609741 2.779877 2.679097 2.853188 4.179983 50
# DT3 <- f %like% "SSA" 17.277338 17.505502 18.561721 17.905504 18.664402 25.701312 50
# DT4 <- x %chin% "SSATD" 8.996969 9.081639 9.731109 9.418635 9.816365 15.183587 50
# DT5 <- x == "SSATD" 4.909027 4.945915 5.174743 5.065182 5.291422 6.290251 50
# DT6 <- f == "SSATD" 2.469650 2.542999 2.640223 2.569298 2.665493 3.129144 50
# DT7 <- (levels(f) == "SSATD")[f] 2.443671 2.502199 2.569179 2.533719 2.600884 3.097417 50
You can see that
In my work we typically have tables of a few 10s millions rows and characters are usually "symbols (ie strings)" with low cardinality (100s unique). Industry standard is to use some factor-like objects. I red in different places (typically this well know SO q) that I should prefer Character. I feel like for pattern matching, data.table is at a loss using strings vs factors, and somehow factors seems less optimized and I do not feel I can drop character for factors.
In the vignettes it is clearly explained that Character columns should be preferred over Factor columns.
Could you point to this? I would want to reword this as I don't think this is the case. Especially with all the internal parallelism in data.table
that doesn't work on character
columns. Don't think we can give any arbitrary advice but in general low-cardinality strings should be kept as factors IMO.
That's not on a vignette, that's on ?setkey
Despite its name, base::sort.list(x,method="radix") actually invokes a counting sort in R, not a radix sort. See do_radixsort in src/main/sort.c. A counting sort, however, is particularly suitable for sorting integers and factors, and we like it. In fact we like it so much that data.table contains a counting sort algorithm for character vectors using R's internal global string cache. This is particularly fast for character vectors containing many duplicates, such as grouped data in a key column. This means that character is often preferred to factor. Factors are still fully supported, in particular ordered factors (where the levels are not in alphabetic order).
I see you actually amended this SO post
I'd like to have a clearer picture of performance for factor vs character for cases where cardinality is small (which would be typical in say finance), is there such a document somewhere ? if not I can create a Rmd testing for J(), ==, pattern matching, merging if you guys think it could be useful (in that case I can commit the document somewhere).
I do think that would be quite useful. I'm not sure if vignette is the right medium for it (it would be our first such vignette) -- would a blog post be more appropriate?
Coming by chance from the NEWS.md 1.14.7 file where this is still referenced as a new feature in development it looks like this has been merged for quite some time. I hope it is fine for me to close it.
Actually I am confused as I checked https://github.com/Rdatatable/data.table/pull/4750/files#diff-88384aa2504222157297b908ad7dc25684c172e11a9da29bcf848283e05290d6 and although I understand the PR has been merged (a long time ago) it does not seem to be in the latest cran version (I checked like.R in the package source available on cran). Can I ask a genuine question (please don't take it the wrong way, I am just trying to understand how things work and although I checked the PR I'm sure I understand the workflow): why is this change that seem to have been accepted a long time ago has not been released as part of a stable (I mean on cran) version ?
This is something that confused me as well. Some fixes that have been merged for more than 1 year are not in CRAN yet, as stated in https://github.com/Rdatatable/data.table/issues/5538#issuecomment-1330088746 .
Hello, I would like to know what is the appropriate way of matching patterns in Character columns. In the vignettes it is clearly explained that Character columns should be preferred over Factor columns. I know of the like function in the package but its performance seems fairly low in the bellow problem where I try to detect a pattern in a Character column and a Factor column. Is there a recommended way to do pattern matching on character columns in data.table