Feature Request - Feather format for cache files

rsangole commented 6 years ago

Report an Issue / Request a Feature

I'm submitting a (Check one with "x") :

[ ] bug report
[x] feature request

Issue Severity Classification -

(Check one with "x") :

[ ] 1 - Severe
[ ] 2 - Moderate
[x] 3 - Low

Expected Behavior

I recommend adding the option of using the feather file format as an option when we cache dataframe objects. Read about feather here and here.

Advantages of feather for dataframe objects:

BLAZING fast compared to RData in R or pickles in Python. I've been using feather at work, and it's been really really fast.
Cached files can be directly accessed from Python (or vice versa) given feather was developed by Wes McKinney & Hadley Wickam together. This helps when collaborating on projects with Python aspects.

Possible Solution

Potential call:

cache('dataframe_object', feather = TRUE)

should save an object called dataframe_object.feather.

Hugovdberg commented 6 years ago

Since the cache files shouldn't be accessed by users on their own I would rather stick to the convention over configuration philosophy and just introduce it as the default if it provides all functionality we need. It sounds pretty awesome, and like something I could really use in my day to day work 👍 Also, we really need a reader for feather files in that case.

rsangole commented 6 years ago

@Hugovdberg the feather package has simple functions write_feather and read_feather which work great.

I don't disagree with the default options of cache (since feather can only deal with dataframes anyways, it cannot deal with other R objects), but a choice for users of the package who have to deal with large datasets would be extremely useful. Thus I'm proposing that cache('name', feather = FALSE) would be the default.

Hugovdberg commented 6 years ago

Better performance shouldn't be an option, it should just be implemented right? ;-) Some checks for object types should be built in anyway to make this possible, so I would suggest we add a tryCatch to use write_feather first, and if it fails we can fallback to base::save. The cached objects are stored in a data.frame anyway so perhaps we can actually cache everything using feather.

I can try to implement this sometime soon, probably over the weekend.

rsangole commented 6 years ago

Always want better performance :+1:

If cache always stores data.frame, then feather is an excellent alternative.

Here is my benchmark comparison between base::save and feather::write_feather on my Intel Xeon 2.6Ghz, 128GB, with a Toshiba 2TB 7200 RPM SATA3 64MB Hard Drive which shows that feather is 18x faster than base for 10mil rows [10x faster for 1mil rows]. You'll see similar performance on the read functionalities as well.

nrows <- 1e7
fake_data <- dplyr::tibble(
    dates = as.POSIXct(lubridate::today()) + 1:nrows,
    random_numers = runif(nrows),
    booleans = sample(c(T, F), size = nrows, replace = T),
    strings = sample(letters, size = nrows, replace = T)
)
na_table <- dplyr::tibble(
    na_rows = sample(1:nrow(fake_data), 1e3, replace = F),
    na_cols = sample(1:ncol(fake_data), 1e3, replace = T)
)
for (i in 1:nrow(na_table)) {
    fake_data[na_table$na_rows[i],
              na_table$na_cols[i]] <- NA
}
head(fake_data)
#> # A tibble: 6 x 4
#>   dates               random_numers booleans strings
#>   <dttm>                      <dbl> <lgl>    <chr>  
#> 1 2018-03-14 20:00:01       0.00834 TRUE     k      
#> 2 2018-03-14 20:00:02       0.00219 FALSE    l      
#> 3 2018-03-14 20:00:03       0.652   FALSE    q      
#> 4 2018-03-14 20:00:04       0.0718  FALSE    w      
#> 5 2018-03-14 20:00:05       0.775   FALSE    z      
#> 6 2018-03-14 20:00:06       0.212   FALSE    t

result_of_timing_test <- microbenchmark::microbenchmark(
    base=save(fake_data, file = 'fake_data.RData'),
    feather=feather::write_feather(fake_data, 'fake_data.feather'),
    times = 10
)

print(result_of_timing_test,signif = 2)
#> Unit: seconds
#>     expr  min   lq mean median   uq  max neval
#>     base 24.0 24.0 24.0   24.0 24.0 26.0    10
#>  feather  1.1  1.1  1.3    1.2  1.4  1.6    10

microbenchmark::autoplot.microbenchmark(result_of_timing_test)
#> Loading required namespace: ggplot2

KentonWhite commented 6 years ago

feather is a good idea for caching. I'm hearing make it a suggestion is the way forward and not a dependency? What do we do about migrating projects? Keep it .rdata unless a user runs migrate.project()? Update the cache to feather silently? Have a mix of .rdata and feather in the cache? Ask the user if they want to upgrade their project?

Hugovdberg commented 6 years ago

I was just looking into using feather for the cache, but my proposed tryCatch solution would create a problem with loading from the cache. The cache should return items exactly as they were written to disk, but read_feather always returns a tibble. Also, I was mistaken that all data is cached inside a data.frame, and even if it was it wouldn't help because the feather package only allows atomic column types.

I did some benchmarks comparing as.data.table(read_feather()) to load(), and feather is a lot faster even with the conversion, but I'm not sure if all uses of data.table are compatible.

My suggestion therefore would be to do the following (pseudo coded):

# Write to cache
if (identical(class(x), 'data.frame') || is.tibble(x)) {
    write_feather(x, file)
} else {
    save(x, file)
}

# Read from cache
if (file_extension == '.feather') {
    assign(varname, read_feather(file))
} else {
    load(file)
}

The major advantage of this is that there is no backward incompatibility. Files should not be manually saved to or read from the cache directory, so we can accept a hybrid state, even just writing new files to feather while keeping old .RData files until the variable is cleared from the cache at one point.

Regarding just implementing or making it optional: I don't feel bad about adding a dependency on feather, but if you guys do we should make it optional.

KentonWhite commented 6 years ago

This looks good. If feather is an optional dependency, we will need a way to select feather or rData in the config, with rData the default.

rsangole commented 6 years ago

@Hugovdberg just following up on this. Are you taking care of this feature, or would you like someone else to pick it up? cheers!

Hugovdberg commented 6 years ago

As mentioned in #191 we should investigate performance of feather and fst in relation to speed and compatibility with data types.

rsangole commented 6 years ago

Agreed, thanks for bringing fst to this discussion.

A quick look at these packages leads me to state...

fst is highly optimized for SSD disks. We should investigate if it has similar performance on non-SSD drives.
Both fst and feather are intended for dataframes only
feather allows the user to have interoperability with Python codes. Not sure if fst does as well

Why can't we support both as an option to cache()? We can keep the default as whichever one is generally the faster option for most users, but users with specific needs can cache to the type they like.

KentonWhite commented 6 years ago

We should choose one. ProjectTemplate is meant to be opinionated. While fst is faster in some circumstances, it does still return a dataframe. We've made the decision to move towards tibbles, which is what feather returns. My opinion is we support feather. Thoughts?

rsangole commented 6 years ago

If we had to pick one, I would also pick feather. Apart from the tibble returned, I also enjoy the fact that it's cross-platform compatible, which enables data scientists to mix Python and R code efficiently. Since it's being developed by Wickham and McKinney, it'll enjoy long-term support too.

Hugovdberg commented 6 years ago

I agree feather is probably the nicest, although I disagree with your argument that interoperability with Python is a pro. The cache isn't meant to be read by other programs. I'm trying to get this to work but there are a few more hickups. The is_tibble function also returns true on sf objects (for spatial dataframes). But the write_feather will issue a warning and output an incomplete file. How do you guys feel about a construction like this (again pseudocoded):

old.warn <- option(warn = 2) # All warnings are errors
try {
    write_feather(variable, cache.file.feather)
    stopifnot(identical(variable, read_feather(cache.file)))
} except {
    if (file.exists(cache.file.feather) {
        delete.file(cache.file.feather)
    }
    save(variable, cache.file.rdata
} finally {
    option(warn = old.warn)
}

This means that every time we don't get exactly the same result back from feather as we tried to write it will fall back to the standard save functionality.

KentonWhite commented 6 years ago

Has this issue been raised with the feather maintainers?

I like checking that the result is the same. A bit concerned that there will be a mixture of .RData and feather files in the cache. But I'm OK with this construction.

rsangole commented 6 years ago

although I disagree with your argument that interoperability with Python is a pro. The cache isn't meant to be read by other programs....

@Hugovdberg this is actually from a usecase I face everyday within the same project. When working with large dataframes (10m+, 100m+ rows), R's interactive visualization methods (plotly, ggplotly etc) are painfully slow. This is an example of a workflow like:

Save to feather in cache/
Use Python code saved in src/python/ to interact with data using glueviz

Another colleague does something similar, where Python's dictionaries and capability to use hashtables results in a combined R+Python approach. Thus, feather+cache has come very handy.

Re: the pseudocode, that seems fine and it'll work well for smaller datasets. The stopifnot(identical(variable, read_feather(cache.file))) might (approximately) double execution time. Don't have a solution for you, but eventually we can figure something out.

gisler commented 4 years ago

Hi,

I would like to bring a rather new package to this discussion: qs.

According to its Using qs vignette, it seems to be fast and able to serialize about all R objects.

gisler commented 3 years ago

Hi all,

I would like to implement the qs (see my comment above) as a (possible) replacement for the RData format as cache files.

My questions now are:

Is this okay with you?
Given it is okay with you, how should we design compatibility to old projects? One possibility would be to add a cache file format option or so to the configuration file allowing one to select the format. Another would be to "upgrade" the RData files on project migration; saving them as qs files for future use, but leave the RData files untouched.

What do you think?

KentonWhite commented 3 years ago

Hi,

This sounds like a really great idea! What I would like to do for compatibility is a staged approach:

Start by making this an option in the config with the default set to the old format.
Then after it is stable, we set the default for new projects to this format.
Then offer a migration path for upgrading with the option to not upgrade if you don't want to.
Then if that is stable we remove the old way and use the new way in all cases.

I like these staged rollouts because it makes it easier to find and fix errors. In Stage 1 we are getting bugs from people who know what they are doing. This helps us more easily debug problems with the qs format. Stage 2 we start to get the newby problems since someone downloads ProjectTemplate, sets up a new project and then runs into issues with qs. Stage 3 lets us discover migration issues before rolling into stage 4.

Does this plan work for you?

gisler commented 3 years ago

Sure, this plan works very well for me. And I'm really glad you like the idea.

I'll add cache_file_format: RData to the configs. The other option will be "qs".

Once I'm done, I'll open a pull request.

KentonWhite / ProjectTemplate