DOI-USGS / mda.streams

backend tools for powstreams
Creative Commons Zero v1.0 Universal
3 stars 8 forks source link

test compression options #229

Closed aappling-usgs closed 8 years ago

aappling-usgs commented 8 years ago

question raised in https://github.com/USGS-R/mda.streams/pull/222#discussion_r57036578: what is the most efficient compression for unitted 2-4-column data.frames, which is how we're storing timeseries datasets on SB?

we agree that fast reading should be the priority, though of course fast writing and small compressed file size are also desirable

aappling-usgs commented 8 years ago

good starting points: http://rpubs.com/hadley/saveRDS https://github.com/aappling-usgs/streamMetabolizer/blob/develop/tests/testthat/helper-save_load.R maybe even http://blog.rstudio.org/2016/03/29/feather/?

aappling-usgs commented 8 years ago

feather looks cool but even if it works for us, i'm thinking we shouldn't adopt it for this project this year, because 1) it's only on github (devtools::install_github("wesm/feather/R")) and (2) "Feather uses C++11, so if you’re on windows, you’ll need the new gcc 4.93 toolchain" which is a whole additional round of installations we'd need to make on every collaborator's computer.

that said, they're pushing hard - next line says, "(All going well this will be included in R 3.3.0, which is scheduled for release on April 14. We’ll aim for a CRAN release soon after that).", so maybe it wouldn't be as bad as all that for collaborators...IF the CRAN release goes through in time...

aappling-usgs commented 8 years ago
# mda.streams data test
library(mda.streams)
dat1 <- get_ts('doobs_nwis', 'nwis_08062500')
dat2 <- get_ts('sitetime_calcLon', 'nwis_01654000')
dat3 <- get_ts(c('dischdaily_calcDMean','er_estBest','doamp_calcDAmp'), 'nwis_07263296')

sl1 <- save_load_timing(dat1, reps=5)
sl2 <- save_load_timing(dat2, reps=5)
sl3 <- save_load_timing(dat3, reps=5)

# > sl1
# Source: local data frame [10 x 8]
# 
#     type level        save       load      size      total typelevel timesize
#    (chr) (dbl)       (dbl)      (dbl)     (dbl)      (dbl)    (fctr)    (dbl)
# 1     gz     1   48.158323  19.740499 0.7653599   67.89882       gz1 1.143106
# 2     gz     6  238.926692  19.830199 0.8620081  258.75689       gz6 1.262351
# 3     gz     9 2929.961255  18.798167 0.8573570 2948.75942       gz9 2.140506
# 4    raw     0    8.590187   9.084234 3.8016768   17.67442      raw0 2.336116
# 5     xz     1  221.396080  62.755205 0.4095116  284.15128       xz1 2.592687
# 6     xz     6  815.813783  60.899201 0.3982887  876.71298       xz6 2.721585
# 7     xz     9  846.573774  62.543268 0.3982887  909.11704       xz9 2.792383
# 8     bz     1  399.983579  80.742574 0.4715405  480.72615       bz1 3.345998
# 9     bz     6  486.986184 101.418807 0.6027946  588.40499       bz6 4.203089
# 10    bz     9  512.466602 109.059519 0.5891695  621.52612       bz9 4.484858
# > sl2
# Source: local data frame [10 x 8]
# 
#     type level      save      load       size      total typelevel timesize
#    (chr) (dbl)     (dbl)     (dbl)      (dbl)      (dbl)    (fctr)    (dbl)
# 1     gz     1   3.44088 1.1785474 0.05821514   4.619427       gz1 1.310645
# 2     gz     6  18.60831 1.0524404 0.07236099  19.660755       gz6 1.465403
# 3    raw     0   1.07508 0.3861819 0.17895031   1.461262      raw0 2.216045
# 4     gz     9 230.90109 0.9951914 0.07295895 231.896281       gz9 2.360159
# 5     xz     1  16.65863 5.0151763 0.03260422  21.673804       xz1 3.181752
# 6     xz     9  67.99456 4.7342939 0.03287125  72.728855       xz9 3.253315
# 7     xz     6  66.31027 5.0717108 0.03287125  71.381981       xz6 3.430716
# 8     bz     1  16.54748 6.2035989 0.03845215  22.751081       bz1 3.897149
# 9     bz     9  17.92943 6.7044844 0.03873825  24.633913       bz9 4.180506
# 10    bz     6  16.04497 7.3075258 0.03873825  23.352495       bz6 4.502438
# > sl3
# Source: local data frame [10 x 8]
# 
#     type level       save      load       size     total typelevel timesize
#    (chr) (dbl)      (dbl)     (dbl)      (dbl)     (dbl)    (fctr)    (dbl)
# 1     gz     1  3.0573434 0.7037264 0.06089783  3.761070       gz1 1.958087
# 2     gz     6  4.4814828 0.6484578 0.06143284  5.129941       gz6 1.969640
# 3    raw     0  0.8493294 0.2703229 0.08471489  1.119652      raw0 2.192119
# 4     gz     9 16.4542774 0.6708427 0.06138515 17.125120       gz9 2.269001
# 5     xz     1 15.7109240 4.8623052 0.05676651 20.573229       xz1 4.806208
# 6     xz     9 38.1641802 4.6907631 0.05625153 42.854943       xz9 5.221858
# 7     bz     1 13.9289436 5.8033907 0.06252384 19.732334       bz1 5.497663
# 8     xz     6 41.8250706 5.3759436 0.05625153 47.201014       xz6 5.744876
# 9     bz     9 14.1852932 6.2934393 0.06252384 20.478733       bz9 5.815258
# 10    bz     6 31.2639484 5.9386473 0.06252384 37.202596       bz6 5.998094
aappling-usgs commented 8 years ago

It looks to me like gz level 1 is a very good option: it reduces file size from raw by 35-80%, is faster than all other compression types and levels for saving, and is in the top few fastest options for loading.

That said, these results make me think we could also get speed gains if we cached files after decompressing them rather than decompressing on the fly when reading and re-reading cached files.