Improve performance issues of new functions

rabutler-usbr commented 6 years ago

The rdf_aggregate() and rw_scen_aggregate() functions are noticeably slower that getDataForAllScens().

Initial profiles and microbenchmark results indicate that creating the table from the rdf is slow, i.e., reading in the csv instead of converting the rdf to the tbl, results in a 30% increase in speed. Additionally, adding the year and month columns is quite slow. Finally, add_var_drop_objectslot() was doing a bunch of unnecessary string matching.

Initial results of creating the data.frame in C++ instead of R results in over a 30x decrease in time for that step.

We'll go ahead and fully re-implement rdf_to_rwtbl() in C++.

[x] initial C++ structure
[x] add in Year and Month columns
- [x] make sure Month is the full month name, and not the month number
[x] address add_var_drop_objectslot() issues see f96f3691
[x] ~~update rdf_to_rwtbl() so that it has a new argument that determines whether it uses the c++ or R versions~~
- added rdf_to_rwtbl2() instead
[x] update DataFrame returned by rdf_to_rwtbl_cpp() so that it returns the correct column types, i.e., strings and doubles, instead of factors
[x] add attributes to DataFrame returned by rdf_to_rwtbl_cpp()
[x] rdf_to_rwtbl_cpp() should use following parameters:
- [x] scenario
- [x] keep_cols
- [x] add_ym
[x] update rdf_aggregate() and rw_scen_aggregate() to use the c++ version
- [x] unit tests for the cpp argument, i.e., compare results of the two versions
[x] implement rdf_to_rwtbl2()
- [x] all checks similar to rdf_to_rwtbl()
- [x] unit tests
- [x] documentation
- [x] deprecate rdf_to_rwtbl()
[x] unit tests for read_rdf()

[x] verify performance is increased

[x] ensure c++ version handles scalar slots the same way as the R version
- [x] unit tests for the scalar slots
[x] remove original c++ version

rabutler-usbr commented 6 years ago

increases in speed from c++:

rdf_vec <- as.matrix(data.table::fread(
  "inst/extdata/Scenario/ISM1988_2014,2007Dems,IG,Most/KeySlots.rdf", 
  sep = '\t', 
  header = FALSE, 
  data.table = FALSE
))

rdf <- read_rdf("inst/extdata/Scenario/ISM1988_2014,2007Dems,IG,Most/KeySlots.rdf")

mb <- microbenchmark::microbenchmark(
  "c++" = RWDataPlyr:::rdf_to_rwtbl_cpp(rdf_vec) %>% tibble::as_tibble(),
  "R" = rdf_to_rwtbl(rdf, keep_cols = TRUE),
  times = 50
)

Unit: milliseconds expr min lq mean median uq max neval c++ 2.855493 2.988783 3.681731 3.157157 3.689432 9.420164 50 R 91.439588 94.371598 104.210205 100.378253 104.839387 203.569684 50

rabutler-usbr commented 6 years ago

initial benchmarks. rwcsv_aggregate() code is in #56.

sal <- slot_agg_list(
  matrix(
    c("KeySlots.rdf", "Mead.Pool Elevation", "EOCY", NA, "meadPe"), 
    nrow = 1
  )
)

rwa <- rwd_agg(data.frame(
  file = "KeySlots.rdf",
  slot = "Mead.Pool Elevation",
  period = "December",
  summary = NA,
  eval = NA,
  t_s = NA,
  variable = "meadPe",
  stringsAsFactors = FALSE
))

scen <- "ISM1988_2014,2007Dems,IG,Most"

spath <- system.file("extdata", "Scenario/", package = "RWDataPlyr")

mb <- microbenchmark::microbenchmark(
  "rdfagg" = rdf_aggregate(rwa, file.path(spath, scen)),
  "rwcsvagg" = rwcsv_aggregate(rwa, file.path(spath, scen)), 
  "scenagg" = rw_scen_aggregate(scen, rwa, spath, scen_names = scen),
  "getallscens" = getDataForAllScens(scen, scen, sal, spath, "tmp.feather"),
  times = 10
)

Unit: milliseconds expr min lq mean median uq max neval rdfagg 142.88253 147.09701 164.82081 152.74294 164.9856 262.35517 10 rwcsvagg 115.72362 124.42268 145.17638 128.23835 145.0416 214.96062 10 scenagg 154.12925 157.36275 181.30090 165.38867 187.5771 257.58638 10 getallscens 22.58018 23.44445 25.08013 24.57688 25.4921 33.07653 10

rabutler-usbr commented 6 years ago

Testing the c++ and original versions yields speeds much closer (though still slower) to getDataForAllScens(). Results and call are below.

For multiple scenarios and multiple slots, the new version is still significantly slower.

sal <- slot_agg_list(
  matrix(
    c("KeySlots.rdf", "Mead.Pool Elevation", "EOCY", NA, "meadPe"), 
    nrow = 1
  )
)

rwa <- rwd_agg(data.frame(
  file = "KeySlots.rdf",
  slot = "Mead.Pool Elevation",
  period = "December",
  summary = NA,
  eval = NA,
  t_s = NA,
  variable = "meadPe",
  stringsAsFactors = FALSE
))

scen <- "ISM1988_2014,2007Dems,IG,Most"

spath <- system.file("extdata", "Scenario/", package = "RWDataPlyr")
mb <- microbenchmark::microbenchmark(
  "rdfagg cpp" = rdf_aggregate(rwa, file.path(spath, scen), cpp = TRUE),
  "rdfagg old" = rdf_aggregate(rwa, file.path(spath, scen), cpp = FALSE),
  "scenagg cpp" = rw_scen_aggregate(scen, rwa, spath, scen_names = scen, cpp = TRUE),
  "scenagg old" = rw_scen_aggregate(scen, rwa, spath, scen_names = scen, cpp = FALSE),
  "getallscens" = getDataForAllScens(scen, scen, sal, spath, "tmp.feather"),
  times = 50
)

Unit: milliseconds expr min lq mean median uq max neval rdfagg cpp 18.125926 22.11543 25.56790 24.08075 27.17055 50.06527 50 rdfagg old 59.040079 68.08118 82.27357 72.54778 81.52400 193.95311 50 scenagg cpp 19.442246 23.88592 27.78990 25.17192 27.07535 116.47187 50 scenagg old 56.164486 70.96946 90.60099 74.78337 84.11714 194.07794 50 getallscens 9.334126 11.36238 15.08190 12.05739 13.55107 124.40470 50

rwa <- rwd_agg(read.csv(
  system.file(
    "extdata/rwd_agg_files/passing_aggs.csv", 
    package = "RWDataPlyr"
  ), 
  stringsAsFactors = FALSE
))

scens <- c("ISM1988_2014,2007Dems,IG,Most", "ISM1988_2014,2007Dems,IG,2002")

Unit: milliseconds expr min lq mean median uq max neval scenagg cpp 131.84174 154.83978 197.87294 167.7440 245.36185 437.18340 50 scenagg old 372.53776 395.34010 483.96448 422.5635 506.26424 1352.75665 50 getallscens 10.94382 12.04928 14.07541 13.1163 15.95522 24.06717 50

rabutler-usbr commented 6 years ago

The second performance check of rw_scen_aggregate() was not comparing to getting the same number of slots in getDataForAllScens(), when it does, it performs better, but is still ~ 4x slower.

rwa <- rwd_agg(read.csv(
  system.file(
    "extdata/rwd_agg_files/passing_aggs.csv", 
    package = "RWDataPlyr"
  ), 
  stringsAsFactors = FALSE
))

sal <- slot_agg_list(matrix(c(
  "KeySlots.rdf", "Mead.Pool Elevation", "AnnMinLTE", 1e+03, "peLt1000",
  "KeySlots.rdf", "Mead.Pool Elevation", "EOCY", NA, "peEocy",
  "KeySlots.rdf", "Powell.Outflow", "BOCY", NA, "julyRel",
  "KeySlots.rdf", "Powell.Outflow", "WYMaxLTE", 400000, "pMonthlyGt400k",
  "KeySlots.rdf", "Powell.Outflow", "AnnualSum", 1e-03, "pwyRel",
  "SystemConditions.rdf", "SummaryOutputData.LBShortageConditions", "AnnualRaw", NA, "short",
  "SystemConditions.rdf", "SummaryOutputData.UpperBalancingAbove823", "AnnualRaw", NA, "ueb823"
), nrow = 7, byrow = TRUE))

scens <- c("ISM1988_2014,2007Dems,IG,Most", "ISM1988_2014,2007Dems,IG,2002")

mb <- microbenchmark::microbenchmark(
  "scenagg cpp" = rw_scen_aggregate(scen, rwa, spath, scen_names = scen, cpp = TRUE),
  "scenagg old" = rw_scen_aggregate(scen, rwa, spath, scen_names = scen, cpp = FALSE),
  "getallscens" = getDataForAllScens(scen, scen, sal, spath, "tmp.feather"),
  times = 50
)

Unit: milliseconds
        expr      min        lq      mean   median        uq      max neval
 scenagg cpp 160.3029 170.61621 207.71346 183.2746 250.29920 350.6526    50
 scenagg old 294.8696 303.90086 341.17568 311.0191 385.55709 504.3767    50
 getallscens  43.4872  44.93857  51.58195  47.3949  50.56352 177.1446    50

BoulderCodeHub / RWDataPlyr

Improve performance issues of new functions #85