drisso / SingleCellExperiment

Clone of the Bioconductor repository for the SingleCellExperiment package, see https://bioconductor.org/packages/devel/bioc/html/SingleCellExperiment.html for the official development version.
65 stars 18 forks source link

Methods for splitting SingleCellExperiment objects #55

Open jma1991 opened 4 years ago

jma1991 commented 4 years ago

Is there scope to define a splitColData and splitRowData methods for the SingleCellExperiment class?

I am working with a rather large SingleCellExperiment object and I often find myself needing to split the object into a list of smaller objects for pre-processing based on either the column or row data.

This can obviously be done with the following:

# Split by column data
var <- colData(sce)$variable
sce <- lapply(var, function(x) sce[, colData(sce)$variable == x])

# Split by row data
var <- rowData(sce)$variable
sce <- lapply(var, function(x) sce[rowData(sce)$variable == x, ])

However, I've found this approach to be slower than using a for-loop with pre-allocation (e.g. similar to the code already in the splitAltExps function):

splitColData <- function(x, f) {

  i <- split(seq_along(f), f)

  v <- vector(mode = "list", length = length(i))

  names(v) <- names(i)

  for (n in names(i)) { v[[n]] <- x[, i[[n]]] }

  return(v)

}

If there is a need for these methods I can submit a pull-request? If not, it would be super helpful if you could advise what is the most robust and efficient method for splitting SCE objects. Thank you.

LTLA commented 4 years ago

However, I've found this approach to be slower than using a for-loop with pre-allocation (e.g. similar to the code already in the splitAltExps function):

Well, yes, that's because you're looping over every element of var rather than its unique levels.

If there is a need for these methods I can submit a pull-request?

Possibly, but this would likely go to the SummarizedExperiment repository rather than this one. Any such methods should benefit all SE subclasses, there isn't any reason that it would just be useful for SCEs.

Tagging @mtmorgan: does this functionality already exist in SE?S4Vectors::split() kind of works but it's hard to remember that it splits by row instead of column in an SE. (Also I just noticed SCE doesn't implement extractROWS properly: need to fix.)

LTLA commented 4 years ago

bc220cab41b7112347dda5e094ebb2a9c987fb23 fixes the split() issue, so a hypothetical splitByRow() would be as easy as:

split(sce, rowData(sce)$variable)
lambdamoses commented 2 months ago

Any update on this? Seurat has the SplitObject function. But actually I'm asking because I'm writing a method to split a SpatialFeatureExperiment object by geometry so for instance cells in different pieces of tissue can be split into different SFE objects; I want to keep the style consistent with any existing split function in SCE and SpatialExperiment that splits by columns rather than rows.

LTLA commented 2 months ago

No, it seems I clobbered my own PR (linked above) and also no one cared about it.

Perhaps consider making a PR to the SummarizedExperiment repo with something like:

# Completely untested!
setGeneric("splitByCol", function(x, f, ...) standardGeneric("splitByCol"))

setMethod("splitByCol", "SummarizedExperiment", function(x, f, ...) {
    f <- as.factor(f)
    by.levels <- split(seq_along(f), f)
    for (i in seq_along(by.levels)) {
        by.levels[[i]] <- x[, by.levels[[i]], drop=FALSE]
    }
    by.levels
})

Don't have the time/will to do it myself but it seems useful enough that a PR would warrant some consideration.

lambdamoses commented 2 months ago

I renamed the split function for SFE to splitByCol and added a generic for it in the SFE package to avoid confusion when split would split by row for SummarizedExperiment. I may do a PR to SummarizedExperiment later but I don't have the time before the Bioc2024 conference.