RGLab / flowCore

Core flow cytometry infrastructure
43 stars 25 forks source link

checking colnames order consistency in [[<- method of flowSet #153

Open mikejiang opened 5 years ago

mikejiang commented 5 years ago

Describe the bug It is somewhat related to the colnames swapping of flowFrame described in #152 Basically [[<- currently doesn't enforce the order of colnames of replacement flowFrame to be identical to the colnames slot of flowSet. This leads to the inconsistent colnames among frames and violates the expectations of flowSet class, in consequence will cause the erroneous operations (silently) on flowSet , e.g. subsetting columns by integers fs[, j], or updating channels colnames<-

This will have more complications for ncdfFlowSet where there is additional origColnames to keep track of the data layout in h5 in which col order can't be changed(efficiently through partial IO) physically once written. To Reproduce

 data(GvHD)
 fs <- GvHD[1:2]
 fr <- GvHD[[1]]
 chnls <- c("A", "B")
 colnames(fs)[c(1,3)] <- chnls
 #swap cols
 colnames(fr)[c(1,3)] <- rev(chnls)
 fs[[1]] <- fr #this should have failed
 colnames(fs)[1:3] #colnames slot remain the original order
[1] "A"     "SSC-H" "B"    
 colnames(fs[[1]])[1:3] #the individual frame has different order
[1] "B"     "SSC-H" "A"    

Expected behavior have the more strict order check in [[<- to ensure all the colnames are identical. The better solution is to eliminate colnames slot from flowSet (and ncdfFlowSet) entirely because this redundant slot serves no good purpose but only creating potential data integrity troubles (typically confusing and hard to troubleshoot) down the road.

jacobpwagner commented 5 years ago

I support eliminating the colnames slot for flowSet for the reasons you mentioned, even if it is a little bit of a hassle to make required changes to downstream methods.

mikejiang commented 5 years ago

Just a further thought, flowFrame stores colnames in both exprs (as colnames attribute) and parameters, and they are forced to be identical including the same order, which means they are redundant and could potentially cause discrepancy when not synced properly (e.g. https://github.com/RGLab/flowCore/blob/trunk/tests/testthat/test-colnames.R#L8-L10).

Right now it doesn't pose an immediate risk as long user code sticks to the official public APIs, which pretty much take care of syncing and guard against such discrepancy. But it is worth to think about the alternative cleaner data structure design by stripping colnames attribute from exprs to reduce the sources of colnames info (there is yet another copy residing at $PnN keyword ...)