Disentangling assays and rowData

lgeistlinger commented 6 years ago

Following up on previous comments (https://github.com/Bioconductor/RaggedExperiment/issues/17):

RaggedExperiment nicely pretends to be a SummarizedExperiment with regard to assays, rowRanges, and colData. However, RaggedExperiment does not seem to support a concept such as rowData for annotation to its rowRanges.

Essentially when you do mcols<- or rowData<-, you're adding to "number of assays". Since each column in mcols is an "assay".

I wonder whether this can be modified. When constructing a RaggedExperiment from a GRangesList, would it be possible to distinguish mcols that become assays and mcols that are annotation? (Currently this turns even non-numeric mcols into assays)

General motivation for being able to annotate additional row information is as for a SummarizedExperiment - however, in my specific use case I'm working with CNV calls where I only want the copy number to become an assay, while I want to keep additional columns as annotation to each individual CNV call (such as probe identifiers, quality measures, etc).

LiNk-NY commented 6 years ago

We could add another structure that will indicate what columns are to be converted to assays and what columns will be shown in rowData. Any thoughts on this @mtmorgan ?

mtmorgan commented 6 years ago

Can this issue be updated with an illustrative example and desired output? The verbal description is hard to follow and I do not know whether the example in the original issue is still relevant.

lgeistlinger commented 6 years ago

Let's consider a typical output from SNP-based CNV callers stored as a data.frame named my.cnv.calls:

> my.cnv.calls
chr  start  end     sample.id  state   num.probes  start.probe   end.probe
chr1 16947  45013   NE001423   3       53          AX-100224358  AX-100363929
chr1 36337  67130   NE001426   1       77          AX-100796878  AX-100422118
chr1 16947  36337   NE001428   3       34          AX-100224358  AX-100796878
chr1 36338  105963  NE001428   0       123         AX-100796878  AX-100828216
chr1 36337  83412   NE001534   3       101         AX-100796878  AX-100765659
chr1 36337  83412   NE001648   3       101         AX-100796878  AX-100765659

We group the calls by sample ID, resulting in a GRangesList. Each element of the list corresponds to a sample, and contains the genomic coordinates of the CNV calls for this sample (along with the copy number state in the state metadata column).

> grl <- makeGRangesListFromDataFrame(my.cnv.calls,
    split.field="sample.id", keep.extra.columns=TRUE)
> grl
## GRangesList object of length 5:
## $NE001423
## GRanges object with 1 range and 4 metadata columns:
##        seqnames  ranges      strand |     state  num.probes  start.probe  end.probe
##           <Rle>  <IRanges>    <Rle> | <integer>  <integer>   <character>  <character>
##    [1]     chr1  16947-45013      * |         3       53     AX-100224358 AX-100363929 
##
## ...
## <4 more elements>

Let's create some colData for the samples:

age <- sample(20:80, 5)
sex <- sample(c("M", "F"), 5, replace=TRUE)
cdat <- DataFrame(age=age, sex=sex)

Now when creating a RaggedExperiment from the GRangesList, all mcols become assays

> ra <- RaggedExperiment(grl, colData=cdat)
> ra
class: RaggedExperiment 
dim: 6 5 
assays(4): state num.probes start.probe end.probe
rownames: NULL
colnames(5): NE001423 NE001426 NE001428 NE001534 NE001648
colData names(2): age sex

However, I would like to determine which mcols become assays (and which are annotation and rather become rowData).

Desired output:

> ra <- RaggedExperiment(grl, colData=cdat, assay.columns="state")
> ra
class: RaggedExperiment 
dim: 6 5 
assays(1): state 
rownames: NULL
colnames(5): NE001423 NE001426 NE001428 NE001534 NE001648
colData names(2): age sex 
rowData names(3): num.probes start.probe end.probe

mtmorgan commented 6 years ago

I don't want to support this. The implication is either that the software is 'smart' enough to fill rowData across ranges that are present in some but not all samples, or that the data is not actually ragged after all and all rows are present in all samples. It seems too arbitrary to know how to solve the first case, and the second case does not represent a ragged experiment.

lgeistlinger commented 6 years ago

There is no need to fill rowData. Upon construction of a RaggedExperiment common or intersecting ranges across samples are not resolved. As in sparseAssay, the number of rows corresponds to the total number of ranges across all samples with duplicates being allowed. As such, there is a clear 1:1 correspondence between the rows of the assay and the rowData.

lwaldron commented 6 years ago

@lgeistlinger your desired specifies what the show method would return, but how would the proposed rowData be used in the API? @mtmorgan, I also don't understand why 'smart'-ness would be required.

lgeistlinger commented 6 years ago

Some instances:

filter all ranges that are not satisfying quality measures annotated in rowData, e.g. filter all cnv calls not supported by a sufficient number of probes:

ind <- rowData(ra)$num.probes > 10
ra <- ra[ind,]

extracting all ranges of a particular type annotated in rowData, e.g. subset the cnv calls that are an amplification

ind <- rowData(ra)$type == "amplification"
ra <- ra[ind,]

two-level queries, e.g. get data associated with the start and the end probe of the cnv calls:

gt <- my.snp.data[rowData(ra)$start.probe, "genotype"]
# now do a GWAS

That is useful for approaches that resolve CNVs on the probe level for CNV-phenotype association analysis as e.g. done here: http://zzz.bwh.harvard.edu/plink/cnv.shtml

lgeistlinger commented 6 years ago

Some more thoughts:

The full-blown RaggedExperiment is sparse and of short-living nature. Most of the times, either disjoinAssay or qreduceAssay will be used shortly after creation of the object to summarize assay values in overlapping ranges across samples. Subsequent analysis will then mostly focus on the resulting summarized matrices.

Imho, having rowData adds to the usefulness of the representation and, in effect, extends the object's life span (meaning here the time it's actively used within the analysis).

The two main aspects to this are:

preprocessing steps on the ranges such as filtering, subsetting, etc. based on criteria annotated in the rowData.
resolving particular ranges in more detail based on criteria annotated in the rowData - often summarization as carried out via simplifyDisjoin and simplifyReduce represents a compromise.

Once you have found a number of summarized ranges to be particularly interesting (eg associated with a phenotype) you might want to go back and resolve individual ranges within the summarized ranges in more detail to study the impact of the summarization.

As an example, we had doubt about a particular CNV in this study:

https://www.nature.com/articles/s41598-018-19782-4

Based on SNP data, the region was inferred to be completely deleted (0 copies). Contradictively, we found a gene contained in this region to be highly expressed.

We thus went back to the individual CNV calls and resolved them on SNP level to demonstrate that no SNP contributing to the deletion calls locates directly on the gene, see Figure S3 in here:

https://static-content.springer.com/esm/art%3A10.1038%2Fs41598-018-19782-4/MediaObjects/41598_2018_19782_MOESM1_ESM.pdf

Long story short, often enough you are having characteristics describing your ranges in your full-blown RaggedExperiment that you would like to investigate in more detail and the rowData is the typical default for storing these characteristics given the SummarizedExperiment paradigm.

Bioconductor / RaggedExperiment

Disentangling assays and rowData #18