Closed LTLA closed 5 years ago
Nudge. Any comments? If none, I'm just going to go ahead and do it.
Aren’t all the assays supposed to have the same dimensions? That wouldn’t work for spike ins.. (Edit: this was in response to a comment that is not here anymore?)
But it does seem an overcomplicated solution... do we really need a secondary slot? I feel like SCE is already quite complex as is...
Couldn’t we use the internal colData for this as we do for size factors? That doesn’t allow to store spike ins metadata, but who really needs them?
Couldn’t we use the internal colData for this as we do for size factors? That doesn’t allow to store spike ins metadata, but who really needs them?
We can get away with not adding an extra slot. Behold the dark arts of S4 hacking:
library(SummarizedExperiment)
constructor <- setClass("TransposedSE", slots=c(se="SummarizedExperiment"))
setMethod("length", "TransposedSE", function(x) ncol(x@se))
setMethod("[", "TransposedSE", function(x, i, j, ..., drop=TRUE) {
x@se <- x@se[,i]
x
})
# Probably missing a few methods, e.g., `[<-` and `c`.
We can then wrap our S(C)E inside this class:
example(SummarizedExperiment, echo=FALSE)
tse <- constructor(se=se)
tse[1:2]
And then, stuff them inside int_colData
as an extra column:
int_coldata <- DataFrame(spikes=I(tse))
int_coldata[1:2,,drop=FALSE]
... and provide getters and setters to pull out the S(C)Es. And users would be none the wiser.
Cue evil cackling.
I guess @hpages would probably have some opinion on this abuse of the S4 machinery.
This is just a couple of quick thoughts but I'm wondering
1) how extensible this is? Ie, will it handle citeseq + spikeins? 2) how much of a problem transposition actually would be if we put into int_coldata? We can always make a class that abstracts operations that require knowledge of the shape of the array. 3) Should we extend/inherit from SingleCellExperiment rather than changing its API? This would be more elegant from a downstream pov, ie finding HVG would have one method if we are the spikeins subclass and another if we aren't. But see item 1) regarding extensibility.
On Tue, Jul 9, 2019, 11:11 PM Aaron Lun notifications@github.com wrote:
We can get away with not adding an extra slot. Behold the dark arts of S4 hacking:
library(SummarizedExperiment)constructor <- setClass("TransposedSE", slots=c(se="SummarizedExperiment")) setMethod("length", "TransposedSE", function(x) ncol(x@se)) setMethod("[", "TransposedSE", function(x, i, j, ..., drop=TRUE) { x@se <- x@se[,i] x }) library(SummarizedExperiment)constructor <- setClass("TransposedSE", slots=c(se="SummarizedExperiment")) setMethod("length", "TransposedSE", function(x) ncol(x@se)) setMethod("[", "TransposedSE", function(x, i, j, ..., drop=TRUE) { x@se <- x@se[,i] x })
Probably missing a few methods, e.g.,
[<-
andc
.And then, inside int_colData:
int_coldata <- DataFrame(spikes=I(tse))int_coldata[1:2,,drop=FALSE]
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/drisso/SingleCellExperiment/issues/30?email_source=notifications&email_token=AALLAHVK45IFU5A7BBFFWUTP6V4RLA5CNFSM4H6C5J3KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODZSM7SY#issuecomment-509923275, or mute the thread https://github.com/notifications/unsubscribe-auth/AALLAHQHW6IYQIATN2ZKU7DP6V4RLANCNFSM4H6C5J3A .
1) how extensible this is? Ie, will it handle citeseq + spikeins?
Presumably it would handle any number of secondary features. We just keep on adding new columns. If you did a mega experiment with cite-seq, CRISPR tags, spike-ins, we would just have multiple columns in the internal coldata with one corresponding to each new feature type.
2) how much of a problem transposition actually would be if we put into int_coldata?
It's not a problem for end-users, only for the SCE developers. But it's a symptom of other problems, namely that we don't consider assays that are shaped such that the rows are the samples. This would effectively require development of some kind of transposed Assays
class to handle multiple assays in transposed form. At this point, it would be easier to just wrap around an SE as above.
3) Should we extend/inherit from SingleCellExperiment rather than changing its API?
Not sure that's necessary. Even for datasets with spike-ins, you can always find HVGs using two methods (i.e., with the spike-ins or ignoring them). That seems to be a decision to be made during the HVG detection step rather than during the object construction and class choice step.
Look, I'll do this and make a PR, and people can tell me if it helps or not.
Check out #32.
Motivation
Our current treatment of spike-ins is to store them in the same matrix with
isSpike
markings to indicate that the row is a spike-in and should not be used in certain downstream analyses. This set-up is a pain to write code for because developers need to remember subset the matrix to remove these spike-ins prior to, e.g., dimensionality reduction and clustering. And I've finally had enough.Proposal
I propose to add a
secondary
slot for objects with synchronized subsetting by column with the rest of the SCE. Thissecondary
slot contains a list of... anythings, but most likely other S(C)Es, which have the same number of columns as the main SCE. This allows the main SCE assays to be reserved for endogenous genes, while preserving spike-in information elsewhere. Downstream methods can then use the primary gene expression matrix "as is" without modification.Methods reliant on spike-ins can simply add another argument that specifies the secondary object to use. (Probably an
"ERCC"
default for such methods would be most appropriate in 99% of cases.) The advantage of this approach is that we can now add spike-in specific metadata fields to therowData
of these secondary SEs, e.g., concentration. This is not currently possible without padding the required fields withNA
s for the endogenous genes.Incidentally, this secondary slot can also be used for FACS intensities or antibody or CRISPR tags from CITE-seq-like technologies. If one needs to use this information, it should be as simple as doing something like
secondary(sce, "Ab")
and passing that to downstream methods.Transition
It should be simple to transition to this new scheme as we are not modifying existing slots.
partitionBySecondaryFeatures()
, which will split a single SCE into a main SCE and secondary SCEs.isSpike()
andisSpike()<-
.type=
option insizeFactors()
, which is now largely unnecessary, as the only reason for its existence was to support alternative size factors for spike-ins.Functions using spike-in information are now expected to extract this information from the secondary slot. AFAIK, this is mostly scran as other packages tend not to care about spike-ins.
Alternatives
We did consider using MAEs, which provides a more symmetric schema to represent the different features (i.e., there is no "primary" SCE, all feature types are given equal billing). However, it's a big jump in the interface from SCEs, and it would require a lot of effort to bring downstream packages in line so that the MAE is easily usable. Currently, its use requires one to
experiment()
out the SCE of interest, apply the desired function on it, and thenexperiment<-
it back into the MAE.Our proposed change avoid this, at least for the genes in the primary SCE. The features in the secondary slot will be subject to this extract-operate-reinsert cycle until downstream functions are developed to explicitly support them.
Notifications
Tagging @amcdavid and @robertamezquita to continue our discussion on Slack; and @mtmorgan, @hpages and @lwaldron for suggestions on alternatives I might have missed.