RGLab / flowWorkspace

flowWorkspace
GNU Affero General Public License v3.0
44 stars 21 forks source link

Allow more efficient manipulation of individual keywords in cytoframes #351

Closed jacobpwagner closed 3 years ago

jacobpwagner commented 3 years ago

The cf_keyword_ methods currently rely on altering individual keywords at the R level in cytoframe objects before re-assigning the full set of keywords. For example, cf_keyword_delete:

1) Pulls the keywords to an R-level list 2) Removes the appropriate entry and reassigns the full list

The replacement is done by constructing an entirely new cytolib::KW_PAIR object and then replacing the full set of keywords using cytolib::CytoFrame::set_keywords

These changes rely on additional direct keyword manipulation methods added by https://github.com/RGLab/cytolib/pull/43 and aim to:

1) Improve efficiency by manipulating individual key-value pairs at the cytolib level to avoid a full construction and replacement via cytolib::CytoFrame::set_keywords. 2) Allow the cf_keyword_ methods to take vectors of keys and values to both make for more convenient/flexible functions and push the loop iterations down to the cytolib level.

A simple example testing/benchmarking script and its output follow:

library(flowCore)
library(flowWorkspace)
library(microbenchmark)

fcs_path <- system.file("extdata", "CytoTrol_CytoTrol_1.fcs", package = "flowWorkspaceData")
cf <- load_cytoframe_from_fcs(fcs_path, backend = "mem")
cf_test <- realize_view(cf)

# Setup for benchmarking and quick demonstration
test_keywords <- c("EXPERIMENT NAME", "WINDOW EXTENSION")
test_values <- c("C3_Bcell", "11.00")
keyword(cf_test)[test_keywords]

test_full <- keyword(cf)
test_full[test_keywords] <- test_values

cf_keyword_set(cf_test, test_keywords, test_values)
keyword(cf_test)[test_keywords]

## The function logic before the changes
cf_keyword_insert_old <- function(cf, keyword, value){
  kw <- keyword(cf)
  kn <- names(kw)
  idx <- match(keyword, kn)
  if(!is.na(idx))
    stop("keyword already exists:", keyword)
  kw[[keyword]] <- value
  keyword(cf) <- kw
}

cf_keyword_delete_old <- function(cf, keyword){
  kw <- keyword(cf)
  kn <- names(kw)
  idx <- match(keyword, kn)
  na_idx <- is.na(idx)
  if(any(na_idx))
    stop("keyword not found:", paste(keyword[na_idx], collapse = ", "))
  keyword(cf) <- kw[-idx]
}

cf_keyword_rename_old <- function(cf, from, to){
  kw <- keyword(cf)
  kn <- names(kw)
  idx <- match(from, kn)
  if(is.na(idx))
    stop("keyword not found:", from)
  names(keyword(cf))[idx] <- to
}

print("Benchmarking...")
for(backend_type in c("mem", "h5", "tile")){
  print("**********\n")
  cf <- load_cytoframe_from_fcs(fcs_path, backend = backend_type)
  cf_test <- realize_view(cf)
  print(paste0("Verifying backend: ", flowWorkspace:::cf_backend_type(cf_test)))
  print("cf_keyword_insert")
  print(microbenchmark(old = {cf_keyword_insert_old(cf_test, "new_key", "new_val")},
                 new = cf_keyword_insert(cf_test, "new_key", "new_val"),
                 times = 10,
                 setup = cf_test <- realize_view(cf)))

  print("cf_keyword_delete")
  print(microbenchmark(old = cf_keyword_delete_old(cf_test, "EXPERIMENT NAME"),
                 new = cf_keyword_delete(cf_test, "EXPERIMENT NAME"),
                 times = 10,
                 setup = cf_test <- realize_view(cf)))

  print("cf_keyword_rename")
  print(microbenchmark(old = cf_keyword_rename_old(cf_test, "EXPERIMENT NAME", "EXPT NM"),
                 new = cf_keyword_rename(cf_test, "EXPERIMENT NAME", "EXPT NM"),
                 times = 10,
                 setup = cf_test <- realize_view(cf)))

  print("cf_keyword_set") #no old comparator other than keyword<-)
  print(microbenchmark(full_replacement = keyword(cf_test)[test_keywords]<-test_values,
                 partial_replacement = cf_keyword_set(cf_test, test_keywords, test_values),
                 times = 10,
                 setup = cf_test <- realize_view(cf)))
  cat("**********\n")
}

The benchmarking output:

**********
[1] "Verifying backend: mem"
[1] "cf_keyword_insert"
Unit: microseconds
 expr      min       lq      mean   median       uq      max neval cld
  old 1800.873 1907.335 2925.3342 2090.085 2754.430 6635.035    10   b
  new  250.306  271.746  313.4167  300.784  335.581  459.588    10  a 
[1] "cf_keyword_delete"
Unit: microseconds
 expr      min       lq      mean   median       uq      max neval cld
  old 1824.475 1917.642 2617.2808 1966.323 2068.181 8435.488    10   b
  new  247.142  251.111  282.2386  267.337  306.658  366.803    10  a 
[1] "cf_keyword_rename"
Unit: microseconds
 expr      min       lq      mean    median       uq      max neval cld
  old 1945.592 2008.290 2513.2999 2020.5335 2175.649 6480.525    10   b
  new  244.523  247.253  296.1272  259.7325  297.966  504.266    10  a 
[1] "cf_keyword_set"
Unit: microseconds
                expr      min       lq      mean    median       uq      max neval cld
    full_replacement 1795.993 1868.284 1965.0380 1914.8370 1980.634 2493.492    10   b
 partial_replacement   23.446   24.628   32.2512   25.4405   41.665   55.462    10  a 
**********
**********
[1] "Verifying backend: h5"
[1] "cf_keyword_insert"
Unit: microseconds
 expr      min       lq      mean   median       uq      max neval cld
  old 1763.239 1872.059 1995.0506 2031.793 2070.011 2193.694    10   b
  new  254.134  264.404  333.4856  339.918  360.156  436.972    10  a 
[1] "cf_keyword_delete"
Unit: microseconds
 expr      min       lq      mean   median       uq      max neval cld
  old 1838.566 1930.412 1976.6836 1947.431 2004.548 2153.734    10   b
  new  273.895  286.799  317.0669  303.564  330.590  385.658    10  a 
[1] "cf_keyword_rename"
Unit: microseconds
 expr      min       lq      mean   median       uq      max neval cld
  old 1967.031 2029.698 2144.7114 2154.432 2181.240 2345.294    10   b
  new  268.429  287.707  320.3019  315.955  354.322  379.741    10  a 
[1] "cf_keyword_set"
Unit: microseconds
                expr      min       lq     mean   median       uq      max neval cld
    full_replacement 1751.977 1918.547 1993.517 1993.418 2115.405 2166.282    10   b
 partial_replacement   28.907   30.727   39.638   33.239   47.835   62.673    10  a 
**********
**********
[1] "Verifying backend: tile"
[1] "cf_keyword_insert"
Unit: microseconds
 expr      min       lq     mean    median       uq      max neval cld
  old 1589.161 1697.365 1744.378 1720.8110 1819.443 1916.139    10   b
  new  251.994  252.300  279.778  259.4335  290.371  357.807    10  a 
[1] "cf_keyword_delete"
Unit: microseconds
 expr      min       lq      mean    median       uq      max neval cld
  old 1543.048 1618.277 1709.1417 1725.5665 1815.820 1856.563    10   b
  new  266.460  269.846  279.1537  275.9585  285.833  297.836    10  a 
[1] "cf_keyword_rename"
Unit: microseconds
 expr      min       lq      mean   median       uq      max neval cld
  old 1709.765 1803.822 1892.0482 1851.283 2001.363 2195.599    10   b
  new  262.759  273.404  307.8543  293.800  338.686  395.405    10  a 
[1] "cf_keyword_set"
Unit: microseconds
                expr      min       lq      mean   median       uq      max neval cld
    full_replacement 1534.826 1629.764 1817.7551 1860.925 2005.054 2069.633    10   b
 partial_replacement   29.006   29.419   34.9146   30.355   32.422   74.375    10  a 
**********
jacobpwagner commented 3 years ago

Just quick summarizing the approximate speed gains (old mean / new mean) for this single-cytoframe example:

cf_keyword_insert: mem: 9.35 h5: 5.99 tile: 6.25

cf_keyword_delete: mem: 9.28 h5: 6.23 tile: 6.13

cf_keyword_rename: mem: 8.49 h5: 6.7 tile: 6.16

cf_keyword_set: mem: 61.4 h5: 51.1 tile: 53.44

So, clearly the biggest gains are in cf_keyword_set over the best approach previously available before its recent addition in https://github.com/RGLab/flowWorkspace/commit/79b4bf0f057c35611b21e1a56f8613af80c69e71. But 6-9x speedup isn't too bad for the other methods as well.

If this all looks good, I'll go ahead and merge it in and then I can incorporate it in to analogous methods for GatingHierarchy, cytoset, and GatingSet.