Closed gisler closed 3 years ago
Thanks @gisler ! I'll have time over the weekend to review in detail. Please don't think I'm ignoring this...
No worries, we all have our duties, but I appreciate your message nonetheless.
Thanks for this @gisler, And thanks for writing clear tests as well!
I noticed that you are using ifelse statements for implementing the cashing strategy. This works if we never are going to support other caching formats other than qs. This might not be the case. What do you think about implementing a strategy pattern with a little bit of meta-programming? What I mean is using a pattern like is used for loading different data types. This is a simple list with the cache type prefix followed by a function that implements the caching strategy.
For example, in load,.project.R
lines 228-232 could use a data structure like:
cache.stratagies <- list(
"qs" = "assign(variable, qs::qread(cache.file), .TargetEnv)",
"R" = "load(cache.file, envir = .TargetEnv)"
)
Then the code could read:
eval(parse(cache.stratagies[config$cache_file_format]))
Of course I mean this a pseudo-code and it is not guaranteed to work, etc....
What do you think?
Hi, you're welcome, glad you like it, @KentonWhite. And yes, I did include tests, although I must say that I more or less only copied and adapted existing tests.
Anyway, while I very much like metaprogramming (I use it quite often in my DTSg
package), I don't think it is necessary here. One could either use an if ... else if ... else
statement to include more formats in the future or probably more elegant the rather unknown switch()
function: https://stat.ethz.ch/R-manual/R-devel/library/base/html/switch.html
It achives about what you suggest, but without the parsing of strings. If you like, I can change the statements so that they make use of this function for future reference.
@gisler The reason I was thinking about meta programming was eliminating the need for someone to find all of the places where a cache file format is used. With meta programming adding a new format is as easy as changing the table with a new function.
I didn’t look for all of the places where there is a switch. If their is only 1 place in the code where this switch is made then the current way is fine. If there are multiple places meta programming might be easier maintenance.
Is there only the 1 place where a code switch is needed?
Kenton White
On Nov 29, 2020, at 11:49 AM, Gerold Hepp notifications@github.com wrote:
Hi, you're welcome, glad you like it, @KentonWhite. And yes, I did include tests, although I must say that I more or less only copied and adapted existing tests.
Anyway, while I very much like metaprogramming (I use it quite often in my DTSg package), I don't think it is necessary here. One could either use an if ... else if ... else statement to include more formats in the future or probably more elegant the rather unknown switch() function: https://stat.ethz.ch/R-manual/R-devel/library/base/html/switch.html
It achives about what you suggest, but without the parsing of strings. If you like, I can change the statements so that they make use of this function for future reference.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.
Ah, I see. Well, there is one for saving, one for loading and one defining the file extensions. If the config equals the file extension (which it actually does as it is now), we can easily erase the need to intervene there. This is a good idea of yours.
Erasing the rest is a bit more complicated though, as there are two expressions involved and for future formats usually also the requirement to load a package. It should be possible nonetheless using expression()
. I'll look into it next week.
Hi @KentonWhite,
I committed a rough implementation of your idea (at least I hope so) yesterday. It still requires some polishing and explanatory comments though: https://github.com/KentonWhite/ProjectTemplate/blob/88343b8a87f51c98e20f37371af095e04c070844/R/cache.R#L171-L184
So with this, cache file formats could in future be specified by a named list of named expressions. One just has to keep in mind to add potentially required packages to the suggested packages list. One drawback though is that the expressions used for saving and loading depend on objects living in the respective functions only. This requires careful explanation.
Thanks @gisler! I'll take a more detailed look over the weekend.
Perfect, I finished the polishing and explanations in the meantime.
Sorry, had to "kick" in a final commit to this PR. The code should be "cleaner" now.
@KentonWhite Oh, before I forget it and since you are preparing a release: https://github.com/KentonWhite/ProjectTemplate/blob/f0f07ec386c70674647b55773cf0974905dd7603/R/project.config.R is missing underscore_variables
in its documentation.
I don't want to include a fix in this PR and it's kinda not worth to open a separate Issue or PR for it, but I can do so in case you prefer.
Thanks @gisler Looks good! One question and 1 comment before I merge in:
Have you checked that migrate project works as expected? It should and is always nice to confirm 👍
Since I needed to bump the version to 0.9.3 to fix an error with the new R release, can you change the version to 0.10.0 please.
Okay, no problem, @KentonWhite. I tested migrating the project and it works, although I wonder why it prints FALSE
:
> migrate.project()
Migrating project configuration
[1] FALSE
Your existing project configuration in globals.dcf does not contain up to
date configuration settings in this version 0.10.0 of ProjectTemplate. They will
be added automatically during migration, but you should review afterward.
Your config.dcf file was missing entries and defaults
have been used. The missing entries are:
cache_file_format
And it made me add a suppressWarnings()
to .load.config()
leading to cleaner messages for users.
By the way, I just did a few non-representative benchmarks on my machine:
library(arrow)
library(qs)
library(microbenchmark)
library(data.table)
data(diamonds, package = "ggplot2")
dim(diamonds)
[1] 53940 10
microbenchmark( write_parquet(diamonds, "test.parquet", version = "2.0"), write_feather(diamonds, "test.feather"), qsave(diamonds, "test.qs"), save(diamonds, file = "test.RData") )
Unit: milliseconds expr min lq mean median uq max neval write_parquet(diamonds, "test.parquet", version = "2.0") 15.31879 15.87638 18.95168 16.46482 18.46235 60.54349 100 write_feather(diamonds, "test.feather") 11.07797 12.84449 24.74501 22.12534 30.57151 84.52629 100 qsave(diamonds, "test.qs") 25.79260 27.25619 33.19539 27.97628 34.34300 98.84864 100 save(diamonds, file = "test.RData") 119.54659 120.99080 122.60779 122.17242 122.56934 190.21427 100
microbenchmark( read_parquet("test.parquet"), read_feather("test.feather"), qread("test.qs"), load("test.RData") )
Unit: milliseconds expr min lq mean median uq max neval read_parquet("test.parquet") 3.683474 3.829881 5.178889 3.917530 4.064670 63.03108 100 read_feather("test.feather") 3.230476 3.459291 4.808098 3.613256 3.767588 62.98305 100 qread("test.qs") 3.571077 3.724800 9.885440 3.804038 4.074666 247.98589 100 load("test.RData") 8.170787 8.422276 13.941885 8.486032 8.703510 72.46114 100
n <- 1e6 DT <- data.table( integer = 1:n, numeric = rnorm(n), character = stringi::stri_rand_strings(n, 12) )
microbenchmark( write_parquet(DT, "test.parquet", version = "2.0"), write_feather(DT, "test.feather"), qsave(DT, "test.qs"), save(DT, file = "test.RData"), times = 12 )
Unit: milliseconds expr min lq mean median uq max neval write_parquet(DT, "test.parquet", version = "2.0") 699.6421 731.7620 746.7013 751.3195 765.9379 777.2242 12 write_feather(DT, "test.feather") 689.9112 719.1069 761.8496 770.7676 798.0216 818.9127 12 qsave(DT, "test.qs") 384.6411 392.1583 404.9736 398.8102 420.8910 429.9794 12 save(DT, file = "test.RData") 1612.9240 1634.8561 1653.2237 1656.8916 1665.7625 1699.2989 12
microbenchmark( read_parquet("test.parquet"), read_feather("test.feather"), qread("test.qs"), load("test.RData"), times = 12 )
Unit: milliseconds expr min lq mean median uq max neval read_parquet("test.parquet") 240.2281 245.9147 258.9374 247.9435 250.3795 324.1385 12 read_feather("test.feather") 222.1977 235.2760 325.8992 303.2075 425.3964 487.9860 12 qread("test.qs") 161.7858 167.5878 213.3876 170.3831 238.4205 426.0841 12 load("test.RData") 398.0150 401.0278 433.4332 409.4679 475.0042 481.9437 12
I generally used default settings. File sizes are in the case of `DT`:
* Parquet: 25.9 MB
* Feather: 26.7 MB
* qs: 15.9 MB
* RData: 19.2 MB
Scaling `DT` to 1e7 rows looks like this with three repetitions for writing
Unit: seconds expr min lq mean median uq max neval write_parquet(DT, "test.parquet", version = "2.0") 6.813148 6.888617 6.963231 6.964086 7.038272 7.112459 3 write_feather(DT, "test.feather") 7.336708 7.373148 7.465723 7.409587 7.530231 7.650874 3 qsave(DT, "test.qs") 4.061056 4.065581 4.068474 4.070105 4.072182 4.074259 3 save(DT, file = "test.RData") 15.569769 15.626296 15.741904 15.682823 15.827971 15.973119 3
and like this for reading
Unit: seconds expr min lq mean median uq max neval read_parquet("test.parquet") 2.521096 2.525684 2.548139 2.530273 2.561661 2.593049 3 read_feather("test.feather") 2.329032 2.332010 2.342300 2.334988 2.348934 2.362880 3 qread("test.qs") 1.609163 1.615720 1.838861 1.622277 1.953710 2.285143 3 load("test.RData") 3.944234 3.982078 4.316626 4.019922 4.502822 4.985722 3
File sizes scaled approximately linearly.
Leaving one column out at a time, changes the timing, but not the ranking and adding a `POSIXct` column also doesn't change much:
n <- 1e7 DT <- data.table( integer = 1:n, numeric = rnorm(n), character = stringi::stri_rand_strings(n, 12), timestamp = as.POSIXct(1:n, origin = "1970-01-01") )
microbenchmark( write_parquet(DT, "test.parquet", version = "2.0"), write_feather(DT, "test.feather"), qsave(DT, "test.qs"), save(DT, file = "test.RData"), times = 3 )
Unit: seconds expr min lq mean median uq max neval write_parquet(DT, "test.parquet", version = "2.0") 7.837889 8.659331 8.979386 9.480773 9.550135 9.619496 3 write_feather(DT, "test.feather") 8.493490 8.580802 8.629774 8.668114 8.697916 8.727718 3 qsave(DT, "test.qs") 4.232069 4.236450 4.296098 4.240831 4.328113 4.415395 3 save(DT, file = "test.RData") 21.064341 21.112579 21.158563 21.160817 21.205675 21.250533 3
Maybe it's different with less random data and non-defaults. I only tried out `rbind`ing the diamonds dataset 200 times resulting in more than ten million rows with ten variables:
Unit: seconds expr min lq mean median uq max neval write_parquet(diamonds, "test.parquet", version = "2.0") 2.989435 3.051021 3.072574 3.112606 3.114143 3.115681 3 write_feather(diamonds, "test.feather") 5.718497 5.860217 6.133362 6.001938 6.340794 6.679651 3 qsave(diamonds, "test.qs") 1.916572 1.928476 1.961913 1.940380 1.984583 2.028787 3 save(diamonds, file = "test.RData") 24.133917 24.273026 24.345237 24.412135 24.450898 24.489660 3
Anyway, I think `qs` definitely is a very, very good cache file format. Especially since Parquet and Feather are limited to tabular data only as far as I can see.
sessionInfo() R version 4.0.2 (2020-06-22) Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows 10 x64 (build 14393)
Matrix products: default
locale:
[1] LC_COLLATE=German_Austria.1252 LC_CTYPE=German_Austria.1252 LC_MONETARY=German_Austria.1252 LC_NUMERIC=C
[5] LC_TIME=German_Austria.1252
attached base packages: [1] stats graphics grDevices utils datasets methods base
other attached packages: [1] microbenchmark_1.4-7 qs_0.23.4 data.table_1.13.0 arrow_2.0.0
loaded via a namespace (and not attached):
[1] Rcpp_1.0.5 digest_0.6.25 assertthat_0.2.1 R6_2.4.1 formatR_1.7 magrittr_1.5 RcppParallel_5.0.2
[8] stringi_1.4.6 rlang_0.4.7 vctrs_0.3.2 stringfish_0.14.2 RApiSerialize_0.1.0 tools_4.0.2 bit64_4.0.2
[15] glue_1.4.1 purrr_0.3.4 bit_4.0.4 compiler_4.0.2 tidyselect_1.1.0
P.S. I know, it's strange to do such tests once the work is done and not before it, but then, I knew `qs` would beat `RData`. Beating Parquet never was the goal.
EDIT: P.P.S. Did some more experimenting: The read and write functions from the `feather` package outperform those from the `arrow` package and together with the `fst` format show about the same performance as `qs`. But this is a very rough estimate.
The migrate test looks bizarre. I will try it on an existing project and see if I get the same issues.
It looks good! merging after tests have passed.
The migrate test looks bizarre. I will try it on an existing project and see if I get the same issues.
Seems like the function is printing if the cache directory is not existing, which was FALSE
in my case, as it was there.
And cool, looking forward to make use of it in production. Hopefully, there are no bugs.
Types of change
Pull request checklist
man
Proposed changes
As shortly discussed in https://github.com/KentonWhite/ProjectTemplate/issues/225, I implemented "qs" as a possible cache file format (https://CRAN.R-project.org/package=qs). The default stays the "RData" file format for now though.
Design considerations:
cache
function doesn't require a loaded project to work. I tried to keep it like this. This, however, has the drawback that using thecache
function without a loaded project might not consider possible config overrides..RData
files already exist and the file format is changed to.qs
, new.qs
files are created upon loading the project. This includes the.hash
files. The following list shows the file extensions of the different cache file formats:.RData
and.hash
.qs
and.qs.hash
The latter also implies that
clear.cache
only deletes the files of the format specified in the config and leaves the other files untouched (might need a test there, but first want to know your opinion about this).