qs file format as cache file format

gisler commented 3 years ago

Types of change

[ ] Bugfix (non-breaking change which fixes an issue)
[x] New feature (non-breaking change which adds functionality)
[ ] Breaking change (fix or feature that would cause existing functionality to not work as expected)
[ ] Other (documentation etc)

Pull request checklist

[x] Add functionality
[x] Add tests
[x] Update documentation in man
[x] Update website documentation

Proposed changes

As shortly discussed in https://github.com/KentonWhite/ProjectTemplate/issues/225, I implemented "qs" as a possible cache file format (https://CRAN.R-project.org/package=qs). The default stays the "RData" file format for now though.

Design considerations:

The cache function doesn't require a loaded project to work. I tried to keep it like this. This, however, has the drawback that using the cache function without a loaded project might not consider possible config overrides.
Switching cache file format is possible without affecting the other cache file format at all. So if .RData files already exist and the file format is changed to .qs, new .qs files are created upon loading the project. This includes the .hash files. The following list shows the file extensions of the different cache file formats:
- RData: .RData and .hash
- qs: .qs and .qs.hash

The latter also implies that clear.cache only deletes the files of the format specified in the config and leaves the other files untouched (might need a test there, but first want to know your opinion about this).

KentonWhite commented 3 years ago

Thanks @gisler ! I'll have time over the weekend to review in detail. Please don't think I'm ignoring this...

gisler commented 3 years ago

No worries, we all have our duties, but I appreciate your message nonetheless.

KentonWhite commented 3 years ago

Thanks for this @gisler, And thanks for writing clear tests as well!

I noticed that you are using ifelse statements for implementing the cashing strategy. This works if we never are going to support other caching formats other than qs. This might not be the case. What do you think about implementing a strategy pattern with a little bit of meta-programming? What I mean is using a pattern like is used for loading different data types. This is a simple list with the cache type prefix followed by a function that implements the caching strategy.

For example, in load,.project.R lines 228-232 could use a data structure like:

cache.stratagies <- list(
  "qs" = "assign(variable, qs::qread(cache.file), .TargetEnv)",
  "R"   = "load(cache.file, envir = .TargetEnv)"
)

Then the code could read:

eval(parse(cache.stratagies[config$cache_file_format]))

Of course I mean this a pseudo-code and it is not guaranteed to work, etc....

What do you think?

gisler commented 3 years ago

Hi, you're welcome, glad you like it, @KentonWhite. And yes, I did include tests, although I must say that I more or less only copied and adapted existing tests.

Anyway, while I very much like metaprogramming (I use it quite often in my DTSg package), I don't think it is necessary here. One could either use an if ... else if ... else statement to include more formats in the future or probably more elegant the rather unknown switch() function: https://stat.ethz.ch/R-manual/R-devel/library/base/html/switch.html

It achives about what you suggest, but without the parsing of strings. If you like, I can change the statements so that they make use of this function for future reference.

KentonWhite commented 3 years ago

@gisler The reason I was thinking about meta programming was eliminating the need for someone to find all of the places where a cache file format is used. With meta programming adding a new format is as easy as changing the table with a new function.

I didn’t look for all of the places where there is a switch. If their is only 1 place in the code where this switch is made then the current way is fine. If there are multiple places meta programming might be easier maintenance.

Is there only the 1 place where a code switch is needed?

Kenton White

On Nov 29, 2020, at 11:49 AM, Gerold Hepp notifications@github.com wrote:

Hi, you're welcome, glad you like it, @KentonWhite. And yes, I did include tests, although I must say that I more or less only copied and adapted existing tests.

Anyway, while I very much like metaprogramming (I use it quite often in my DTSg package), I don't think it is necessary here. One could either use an if ... else if ... else statement to include more formats in the future or probably more elegant the rather unknown switch() function: https://stat.ethz.ch/R-manual/R-devel/library/base/html/switch.html

It achives about what you suggest, but without the parsing of strings. If you like, I can change the statements so that they make use of this function for future reference.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.

gisler commented 3 years ago

Ah, I see. Well, there is one for saving, one for loading and one defining the file extensions. If the config equals the file extension (which it actually does as it is now), we can easily erase the need to intervene there. This is a good idea of yours.

Erasing the rest is a bit more complicated though, as there are two expressions involved and for future formats usually also the requirement to load a package. It should be possible nonetheless using expression(). I'll look into it next week.

gisler commented 3 years ago

Hi @KentonWhite,

I committed a rough implementation of your idea (at least I hope so) yesterday. It still requires some polishing and explanatory comments though: https://github.com/KentonWhite/ProjectTemplate/blob/88343b8a87f51c98e20f37371af095e04c070844/R/cache.R#L171-L184

So with this, cache file formats could in future be specified by a named list of named expressions. One just has to keep in mind to add potentially required packages to the suggested packages list. One drawback though is that the expressions used for saving and loading depend on objects living in the respective functions only. This requires careful explanation.

KentonWhite commented 3 years ago

Thanks @gisler! I'll take a more detailed look over the weekend.

gisler commented 3 years ago

Perfect, I finished the polishing and explanations in the meantime.

gisler commented 3 years ago

Sorry, had to "kick" in a final commit to this PR. The code should be "cleaner" now.

gisler commented 3 years ago

@KentonWhite Oh, before I forget it and since you are preparing a release: https://github.com/KentonWhite/ProjectTemplate/blob/f0f07ec386c70674647b55773cf0974905dd7603/R/project.config.R is missing underscore_variables in its documentation.

I don't want to include a fix in this PR and it's kinda not worth to open a separate Issue or PR for it, but I can do so in case you prefer.

KentonWhite commented 3 years ago

Thanks @gisler Looks good! One question and 1 comment before I merge in:

Have you checked that migrate project works as expected? It should and is always nice to confirm 👍

Since I needed to bump the version to 0.9.3 to fix an error with the new R release, can you change the version to 0.10.0 please.

gisler commented 3 years ago

Okay, no problem, @KentonWhite. I tested migrating the project and it works, although I wonder why it prints FALSE:

> migrate.project()
Migrating project configuration
[1] FALSE
Your existing project configuration in globals.dcf does not contain up to
date configuration settings in this version 0.10.0 of ProjectTemplate.  They will
be added automatically during migration, but you should review afterward.
Your config.dcf file was missing entries and defaults
have been used.  The missing entries are:
 cache_file_format

And it made me add a suppressWarnings() to .load.config() leading to cleaner messages for users.

gisler commented 3 years ago

By the way, I just did a few non-representative benchmarks on my machine:

library(arrow)
library(qs)
library(microbenchmark)
library(data.table)

data(diamonds, package = "ggplot2")
dim(diamonds)

[1] 53940 10

microbenchmark(
write_parquet(diamonds, "test.parquet", version = "2.0"),
write_feather(diamonds, "test.feather"),
qsave(diamonds, "test.qs"),
save(diamonds, file = "test.RData")
)

Unit: milliseconds
expr       min        lq      mean    median        uq       max neval
write_parquet(diamonds, "test.parquet", version = "2.0")  15.31879  15.87638  18.95168  16.46482  18.46235  60.54349   100
write_feather(diamonds, "test.feather")  11.07797  12.84449  24.74501  22.12534  30.57151  84.52629   100
qsave(diamonds, "test.qs")  25.79260  27.25619  33.19539  27.97628  34.34300  98.84864   100
save(diamonds, file = "test.RData") 119.54659 120.99080 122.60779 122.17242 122.56934 190.21427   100

microbenchmark(
read_parquet("test.parquet"),
read_feather("test.feather"),
qread("test.qs"),
load("test.RData")
)

Unit: milliseconds
expr      min       lq      mean   median       uq       max neval
read_parquet("test.parquet") 3.683474 3.829881  5.178889 3.917530 4.064670  63.03108   100
read_feather("test.feather") 3.230476 3.459291  4.808098 3.613256 3.767588  62.98305   100
qread("test.qs") 3.571077 3.724800  9.885440 3.804038 4.074666 247.98589   100
load("test.RData") 8.170787 8.422276 13.941885 8.486032 8.703510  72.46114   100


n <- 1e6
DT <- data.table(
integer = 1:n,
numeric = rnorm(n),
character = stringi::stri_rand_strings(n, 12)
)

microbenchmark( write_parquet(DT, "test.parquet", version = "2.0"), write_feather(DT, "test.feather"), qsave(DT, "test.qs"), save(DT, file = "test.RData"), times = 12 )

Unit: milliseconds expr min lq mean median uq max neval write_parquet(DT, "test.parquet", version = "2.0") 699.6421 731.7620 746.7013 751.3195 765.9379 777.2242 12 write_feather(DT, "test.feather") 689.9112 719.1069 761.8496 770.7676 798.0216 818.9127 12 qsave(DT, "test.qs") 384.6411 392.1583 404.9736 398.8102 420.8910 429.9794 12 save(DT, file = "test.RData") 1612.9240 1634.8561 1653.2237 1656.8916 1665.7625 1699.2989 12

microbenchmark( read_parquet("test.parquet"), read_feather("test.feather"), qread("test.qs"), load("test.RData"), times = 12 )

Unit: milliseconds expr min lq mean median uq max neval read_parquet("test.parquet") 240.2281 245.9147 258.9374 247.9435 250.3795 324.1385 12 read_feather("test.feather") 222.1977 235.2760 325.8992 303.2075 425.3964 487.9860 12 qread("test.qs") 161.7858 167.5878 213.3876 170.3831 238.4205 426.0841 12 load("test.RData") 398.0150 401.0278 433.4332 409.4679 475.0042 481.9437 12


I generally used default settings. File sizes are in the case of `DT`:
* Parquet: 25.9 MB
* Feather: 26.7 MB
* qs: 15.9 MB
* RData: 19.2 MB

Scaling `DT` to 1e7 rows looks like this with three repetitions for writing

Unit: seconds expr min lq mean median uq max neval write_parquet(DT, "test.parquet", version = "2.0") 6.813148 6.888617 6.963231 6.964086 7.038272 7.112459 3 write_feather(DT, "test.feather") 7.336708 7.373148 7.465723 7.409587 7.530231 7.650874 3 qsave(DT, "test.qs") 4.061056 4.065581 4.068474 4.070105 4.072182 4.074259 3 save(DT, file = "test.RData") 15.569769 15.626296 15.741904 15.682823 15.827971 15.973119 3

and like this for reading

Unit: seconds expr min lq mean median uq max neval read_parquet("test.parquet") 2.521096 2.525684 2.548139 2.530273 2.561661 2.593049 3 read_feather("test.feather") 2.329032 2.332010 2.342300 2.334988 2.348934 2.362880 3 qread("test.qs") 1.609163 1.615720 1.838861 1.622277 1.953710 2.285143 3 load("test.RData") 3.944234 3.982078 4.316626 4.019922 4.502822 4.985722 3


File sizes scaled approximately linearly.

Leaving one column out at a time, changes the timing, but not the ranking and adding a `POSIXct` column also doesn't change much:

n <- 1e7 DT <- data.table( integer = 1:n, numeric = rnorm(n), character = stringi::stri_rand_strings(n, 12), timestamp = as.POSIXct(1:n, origin = "1970-01-01") )

microbenchmark( write_parquet(DT, "test.parquet", version = "2.0"), write_feather(DT, "test.feather"), qsave(DT, "test.qs"), save(DT, file = "test.RData"), times = 3 )

Unit: seconds expr min lq mean median uq max neval write_parquet(DT, "test.parquet", version = "2.0") 7.837889 8.659331 8.979386 9.480773 9.550135 9.619496 3 write_feather(DT, "test.feather") 8.493490 8.580802 8.629774 8.668114 8.697916 8.727718 3 qsave(DT, "test.qs") 4.232069 4.236450 4.296098 4.240831 4.328113 4.415395 3 save(DT, file = "test.RData") 21.064341 21.112579 21.158563 21.160817 21.205675 21.250533 3


Maybe it's different with less random data and non-defaults. I only tried out `rbind`ing the diamonds dataset 200 times resulting in more than ten million rows with ten variables:

Unit: seconds expr min lq mean median uq max neval write_parquet(diamonds, "test.parquet", version = "2.0") 2.989435 3.051021 3.072574 3.112606 3.114143 3.115681 3 write_feather(diamonds, "test.feather") 5.718497 5.860217 6.133362 6.001938 6.340794 6.679651 3 qsave(diamonds, "test.qs") 1.916572 1.928476 1.961913 1.940380 1.984583 2.028787 3 save(diamonds, file = "test.RData") 24.133917 24.273026 24.345237 24.412135 24.450898 24.489660 3


Anyway, I think `qs` definitely is a very, very good cache file format. Especially since Parquet and Feather are limited to tabular data only as far as I can see.

sessionInfo() R version 4.0.2 (2020-06-22) Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows 10 x64 (build 14393)

Matrix products: default

locale: [1] LC_COLLATE=German_Austria.1252 LC_CTYPE=German_Austria.1252 LC_MONETARY=German_Austria.1252 LC_NUMERIC=C
[5] LC_TIME=German_Austria.1252

attached base packages: [1] stats graphics grDevices utils datasets methods base

other attached packages: [1] microbenchmark_1.4-7 qs_0.23.4 data.table_1.13.0 arrow_2.0.0

loaded via a namespace (and not attached): [1] Rcpp_1.0.5 digest_0.6.25 assertthat_0.2.1 R6_2.4.1 formatR_1.7 magrittr_1.5 RcppParallel_5.0.2 [8] stringi_1.4.6 rlang_0.4.7 vctrs_0.3.2 stringfish_0.14.2 RApiSerialize_0.1.0 tools_4.0.2 bit64_4.0.2
[15] glue_1.4.1 purrr_0.3.4 bit_4.0.4 compiler_4.0.2 tidyselect_1.1.0



P.S. I know, it's strange to do such tests once the work is done and not before it, but then, I knew `qs` would beat `RData`. Beating Parquet never was the goal.

EDIT: P.P.S. Did some more experimenting: The read and write functions from the `feather` package outperform those from the `arrow` package and together with the `fst` format show about the same performance as `qs`. But this is a very rough estimate.

KentonWhite commented 3 years ago

The migrate test looks bizarre. I will try it on an existing project and see if I get the same issues.

KentonWhite commented 3 years ago

It looks good! merging after tests have passed.

gisler commented 3 years ago

The migrate test looks bizarre. I will try it on an existing project and see if I get the same issues.

Seems like the function is printing if the cache directory is not existing, which was FALSE in my case, as it was there.

And cool, looking forward to make use of it in production. Hopefully, there are no bugs.

KentonWhite / ProjectTemplate