DillonHammill / CytoExploreR

Interactive Cytometry Data Analysis
61 stars 13 forks source link

Speed Boost - Sampling & Coercion of cytoset to cytoframe #99

Closed DillonHammill closed 3 years ago

DillonHammill commented 3 years ago

Currently, CytoExploreR handles sampling and coercion of cytosets independently. This means that the data is first coerced to cytoframe and then sampled downstream using cyto_sample(). The initial coercion step is by far the most computationally taxing step so we need to consider sampling prior to merging to sped up this process.

Preparation of data:

library(CytoExploreR)
library(CytoExploreRData)

gs <- cyto_load(
  system.file(
    "extdata/Activation-GatingSet",
    package = "CytoExploreRData"
    )
  )

cs <- cyto_data_extract(gs)[[1]]

If we use the current approach of coercion and then sampling (total events = 50000):

system.time({
   cf <- flowFrame_to_cytoframe(as(cs, "flowFrame"))
  cyto_sample(cf,
              50000, 
              seed = 56)
})
   user  system elapsed 
   6.24    1.68    8.14 

If we sample each cytoframe prior to merging (total events ~ 50000):

system.time({
  cs_new <- cyto_sample(cs,
                        1515, 
                        seed = 56)
  flowFrame_to_cytoframe(as(cs_new, "flowFrame"))
})
   user  system elapsed 
   2.17    0.90    3.17 

If we sample each cytoframe, merge and then sample again (total events = 50000):

system.time({
  cs_new <- cyto_sample(cs,
                        1516, 
                        seed = 56)
  cf_new <- flowFrame_to_cytoframe(as(cs_new, "flowFrame"))
  cyto_sample(cf_new,
              50000,
              seed = 56)
})
   user  system elapsed 
   2.07    0.75    2.89 

Sampling each cytoframe prior to merging does offer a significant speed boost but there will not be exactly 50000 events in the merged sample due to rounding (1515 * 33 = 49995). Exact counts can be obtained by sampling slightly more events per cytoframe and the sampling the merged cytoframe to the desired number of events. Interestingly this double sampling approach is still significantly faster than merging and sampling afterwards.

I have written a new cyto_coerce() function to implement the second approach for now.

system.time({
  cyto_coerce(cs, 
              display = 50000)
})
   user  system elapsed 
   2.65    1.19    3.94 

Perhaps a better approach would be to use the indices directly and remove sampleFilter() completely from cyto_sample(). I will give this a try and report back.

DillonHammill commented 3 years ago

Here is comparison of cyto_sample() using a sampleFilter or using sampled indices:

cf <- flowFrame_to_cytoframe(as(cs, "flowFrame"))

# SAMPLE FILTER
system.time({
  cyto_sample_v1(cf,
              200000, 
              seed = 56)
})
   user  system elapsed 
   0.22    0.28    0.50 
# SAMPLE INDICES
system.time({
  cyto_sample_v2(cf,
                 200000, 
                 seed = 56)
})
   user  system elapsed 
   0.25    0.24    0.48 

Looks like there are marginal benefits (if any) of using sampled row indices instead of sampleFilter. I will leave cyto_sample() alone for now.

DillonHammill commented 3 years ago

I will close this for now as cyto_coerce() will now be used where possible, particularly in cyto_merge_by(). This should offer substantial speed improvements to cyto_plot() and cyto_gate_draw() once implemented.