Numeric groupBy not returning correct samples

DillonHammill commented 6 years ago

Hi @mikejiang,

I have been trying to use the groupBy argument to split samples prior to gating. My understanding is that a numeric groupBy (n) spilts the data every n samples (i.e. if there are 6 samples and groupBy is set to 2, there should be 3 groups of 2 samples each group1 = files 1&2 group2 = files 3&4 group3 = files 5&6). This group assignment should be the same with every run of the gating pipeline, but this does not seem to be the case - the groups are changing each time. See trouble shooting below:

# fs is a flowSet containing 6 flowFrames labelled Sample1 to Sample6
fs 

#  add treatment info to pData - 3 groups A, B and C
pData(fs)$Treatment <- c("A","A","B","C","B","C")

# add samples to GatingSet
gs <- GatingSet(fs)

# to report the grouping I wrote a preprocessing function which prints the sampleNames of the 
# groups (fs) to a global option called "samples"
ppTest <- function(fs, gs, gm, channels = NA, groupBy = NA, isCollapse = NA, ...){
options("samples" = c(getOption("samples"), list(sampleNames(fs))))
}
registerPlugins(fun = ppTest, methodName = 'ppTest', dep=NA, "preprocessing")

# Set "samples" option to empty list()
options("samples" = list())

# gate using add_pop API
row <- add_pop(gs,
               alias = "Test",
               pop = "+",
               dims = "FSC-A,SSC-A",
               parent = "root",
               gating_method = "flowClust",
               gating_args = "K=2,target=c(50000,50000)" ,
               groupBy = 2,
               collapseDataForGating = TRUE,
               preprocessing_method = "ppTest")

# printing "samples" should return a list of length 3 (1 element per group) with 2 sample names
# each (e.g. Sample1 & Sample2, Sample3 & Sample4, Sample5 & Sample6) but this is not the case...
getOption("samples")

# Reset "samples" option
options("samples" = list())

# Remove gate
Rm("Test", gs)

# Re-run add_pop() and print "samples"
getOption("samples")

# This output is different every time add_pop is run - grouping is not consistent...

Grouping using pData column names e.g. "Treatment" performs as expected with every run.

I think removing sample() from this line will fix the problem: https://github.com/RGLab/openCyto/blob/26b062b09db7a68544cd29bd1799a06e8773435f/R/preprocessing-method.R#L67

Dillon

mikejiang commented 6 years ago

Randomly grouping samples were intended behavior as far as I see based on the code written (not by me, btw). If you want consistency, just set seed. If you want to have the grouping to be meaning in your data context, then assign the group column as study variable to pData properly and group by that.

gfinak commented 6 years ago

I can't see why that random splitting behavior would be desired. We should at least document it: @jacobpwagner

jacobpwagner commented 6 years ago

I'll definitely flesh out the doc, but should we change over the default behavior to what most people would expect? The inline comment even says "split by every N samples" rather than "split in to random samples of size N".

gfinak commented 6 years ago

Okay, so I'm not sure where that behavior came from but I'll go with the intent of the inline comment. If it says split every nth sample then I agree we should change it to do that.

jacobpwagner commented 6 years ago

Yeah, that is actually also what the documentation says indirectly through the doc for gtMethod, which ppMethod extends. Updated in d79f31408b06d9f2c99e171d8ce2c2df7de7180e.

DillonHammill commented 6 years ago

Thanks.

RGLab / openCyto

Numeric groupBy not returning correct samples #187