drisso / SingleCellExperiment

Clone of the Bioconductor repository for the SingleCellExperiment package, see https://bioconductor.org/packages/devel/bioc/html/SingleCellExperiment.html for the official development version.
65 stars 18 forks source link

Secondary objects for alternative feature types #30

Closed LTLA closed 5 years ago

LTLA commented 5 years ago

Motivation

Our current treatment of spike-ins is to store them in the same matrix with isSpike markings to indicate that the row is a spike-in and should not be used in certain downstream analyses. This set-up is a pain to write code for because developers need to remember subset the matrix to remove these spike-ins prior to, e.g., dimensionality reduction and clustering. And I've finally had enough.

Proposal

I propose to add a secondary slot for objects with synchronized subsetting by column with the rest of the SCE. This secondary slot contains a list of... anythings, but most likely other S(C)Es, which have the same number of columns as the main SCE. This allows the main SCE assays to be reserved for endogenous genes, while preserving spike-in information elsewhere. Downstream methods can then use the primary gene expression matrix "as is" without modification.

Methods reliant on spike-ins can simply add another argument that specifies the secondary object to use. (Probably an "ERCC" default for such methods would be most appropriate in 99% of cases.) The advantage of this approach is that we can now add spike-in specific metadata fields to the rowData of these secondary SEs, e.g., concentration. This is not currently possible without padding the required fields with NAs for the endogenous genes.

Incidentally, this secondary slot can also be used for FACS intensities or antibody or CRISPR tags from CITE-seq-like technologies. If one needs to use this information, it should be as simple as doing something like secondary(sce, "Ab") and passing that to downstream methods.

Transition

It should be simple to transition to this new scheme as we are not modifying existing slots.

Functions using spike-in information are now expected to extract this information from the secondary slot. AFAIK, this is mostly scran as other packages tend not to care about spike-ins.

Alternatives

We did consider using MAEs, which provides a more symmetric schema to represent the different features (i.e., there is no "primary" SCE, all feature types are given equal billing). However, it's a big jump in the interface from SCEs, and it would require a lot of effort to bring downstream packages in line so that the MAE is easily usable. Currently, its use requires one to experiment() out the SCE of interest, apply the desired function on it, and then experiment<- it back into the MAE.

Our proposed change avoid this, at least for the genes in the primary SCE. The features in the secondary slot will be subject to this extract-operate-reinsert cycle until downstream functions are developed to explicitly support them.

Notifications

Tagging @amcdavid and @robertamezquita to continue our discussion on Slack; and @mtmorgan, @hpages and @lwaldron for suggestions on alternatives I might have missed.

LTLA commented 5 years ago

Nudge. Any comments? If none, I'm just going to go ahead and do it.

drisso commented 5 years ago

Aren’t all the assays supposed to have the same dimensions? That wouldn’t work for spike ins.. (Edit: this was in response to a comment that is not here anymore?)

But it does seem an overcomplicated solution... do we really need a secondary slot? I feel like SCE is already quite complex as is...

Couldn’t we use the internal colData for this as we do for size factors? That doesn’t allow to store spike ins metadata, but who really needs them?

LTLA commented 5 years ago

Couldn’t we use the internal colData for this as we do for size factors? That doesn’t allow to store spike ins metadata, but who really needs them?

LTLA commented 5 years ago

We can get away with not adding an extra slot. Behold the dark arts of S4 hacking:

library(SummarizedExperiment)
constructor <- setClass("TransposedSE", slots=c(se="SummarizedExperiment"))

setMethod("length", "TransposedSE", function(x) ncol(x@se))

setMethod("[", "TransposedSE", function(x, i, j, ..., drop=TRUE) {
    x@se <- x@se[,i]
    x
})

# Probably missing a few methods, e.g., `[<-` and `c`.

We can then wrap our S(C)E inside this class:

example(SummarizedExperiment, echo=FALSE)
tse <- constructor(se=se)
tse[1:2]

And then, stuff them inside int_colData as an extra column:

int_coldata <- DataFrame(spikes=I(tse))
int_coldata[1:2,,drop=FALSE]

... and provide getters and setters to pull out the S(C)Es. And users would be none the wiser.

Cue evil cackling.

I guess @hpages would probably have some opinion on this abuse of the S4 machinery.

amcdavid commented 5 years ago

This is just a couple of quick thoughts but I'm wondering

1) how extensible this is? Ie, will it handle citeseq + spikeins? 2) how much of a problem transposition actually would be if we put into int_coldata? We can always make a class that abstracts operations that require knowledge of the shape of the array. 3) Should we extend/inherit from SingleCellExperiment rather than changing its API? This would be more elegant from a downstream pov, ie finding HVG would have one method if we are the spikeins subclass and another if we aren't. But see item 1) regarding extensibility.

On Tue, Jul 9, 2019, 11:11 PM Aaron Lun notifications@github.com wrote:

We can get away with not adding an extra slot. Behold the dark arts of S4 hacking:

library(SummarizedExperiment)constructor <- setClass("TransposedSE", slots=c(se="SummarizedExperiment")) setMethod("length", "TransposedSE", function(x) ncol(x@se)) setMethod("[", "TransposedSE", function(x, i, j, ..., drop=TRUE) { x@se <- x@se[,i] x }) library(SummarizedExperiment)constructor <- setClass("TransposedSE", slots=c(se="SummarizedExperiment")) setMethod("length", "TransposedSE", function(x) ncol(x@se)) setMethod("[", "TransposedSE", function(x, i, j, ..., drop=TRUE) { x@se <- x@se[,i] x })

Probably missing a few methods, e.g., [<- and c.

And then, inside int_colData:

int_coldata <- DataFrame(spikes=I(tse))int_coldata[1:2,,drop=FALSE]

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/drisso/SingleCellExperiment/issues/30?email_source=notifications&email_token=AALLAHVK45IFU5A7BBFFWUTP6V4RLA5CNFSM4H6C5J3KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODZSM7SY#issuecomment-509923275, or mute the thread https://github.com/notifications/unsubscribe-auth/AALLAHQHW6IYQIATN2ZKU7DP6V4RLANCNFSM4H6C5J3A .

LTLA commented 5 years ago

1) how extensible this is? Ie, will it handle citeseq + spikeins?

Presumably it would handle any number of secondary features. We just keep on adding new columns. If you did a mega experiment with cite-seq, CRISPR tags, spike-ins, we would just have multiple columns in the internal coldata with one corresponding to each new feature type.

2) how much of a problem transposition actually would be if we put into int_coldata?

It's not a problem for end-users, only for the SCE developers. But it's a symptom of other problems, namely that we don't consider assays that are shaped such that the rows are the samples. This would effectively require development of some kind of transposed Assays class to handle multiple assays in transposed form. At this point, it would be easier to just wrap around an SE as above.

3) Should we extend/inherit from SingleCellExperiment rather than changing its API?

Not sure that's necessary. Even for datasets with spike-ins, you can always find HVGs using two methods (i.e., with the spike-ins or ignoring them). That seems to be a decision to be made during the HVG detection step rather than during the object construction and class choice step.

LTLA commented 5 years ago

Look, I'll do this and make a PR, and people can tell me if it helps or not.

LTLA commented 5 years ago

Check out #32.