Open lgeistlinger opened 6 years ago
We could add another structure that will indicate what columns are to be converted to assays
and what columns will be shown in rowData
.
Any thoughts on this @mtmorgan ?
Can this issue be updated with an illustrative example and desired output? The verbal description is hard to follow and I do not know whether the example in the original issue is still relevant.
Let's consider a typical output from SNP-based CNV callers stored as a data.frame
named my.cnv.calls
:
> my.cnv.calls
chr start end sample.id state num.probes start.probe end.probe
chr1 16947 45013 NE001423 3 53 AX-100224358 AX-100363929
chr1 36337 67130 NE001426 1 77 AX-100796878 AX-100422118
chr1 16947 36337 NE001428 3 34 AX-100224358 AX-100796878
chr1 36338 105963 NE001428 0 123 AX-100796878 AX-100828216
chr1 36337 83412 NE001534 3 101 AX-100796878 AX-100765659
chr1 36337 83412 NE001648 3 101 AX-100796878 AX-100765659
We group the calls by sample ID, resulting in a GRangesList
.
Each element of the list corresponds to a sample, and contains the genomic coordinates of the CNV calls for this sample (along with the copy number state in the state
metadata column).
> grl <- makeGRangesListFromDataFrame(my.cnv.calls,
split.field="sample.id", keep.extra.columns=TRUE)
> grl
## GRangesList object of length 5:
## $NE001423
## GRanges object with 1 range and 4 metadata columns:
## seqnames ranges strand | state num.probes start.probe end.probe
## <Rle> <IRanges> <Rle> | <integer> <integer> <character> <character>
## [1] chr1 16947-45013 * | 3 53 AX-100224358 AX-100363929
##
## ...
## <4 more elements>
Let's create some colData
for the samples:
age <- sample(20:80, 5)
sex <- sample(c("M", "F"), 5, replace=TRUE)
cdat <- DataFrame(age=age, sex=sex)
Now when creating a RaggedExperiment
from the GRangesList
, all mcols become assays
> ra <- RaggedExperiment(grl, colData=cdat)
> ra
class: RaggedExperiment
dim: 6 5
assays(4): state num.probes start.probe end.probe
rownames: NULL
colnames(5): NE001423 NE001426 NE001428 NE001534 NE001648
colData names(2): age sex
However, I would like to determine which mcols become assays (and which are annotation
and rather become rowData
).
Desired output:
> ra <- RaggedExperiment(grl, colData=cdat, assay.columns="state")
> ra
class: RaggedExperiment
dim: 6 5
assays(1): state
rownames: NULL
colnames(5): NE001423 NE001426 NE001428 NE001534 NE001648
colData names(2): age sex
rowData names(3): num.probes start.probe end.probe
I don't want to support this. The implication is either that the software is 'smart' enough to fill rowData across ranges that are present in some but not all samples, or that the data is not actually ragged after all and all rows are present in all samples. It seems too arbitrary to know how to solve the first case, and the second case does not represent a ragged experiment.
There is no need to fill rowData. Upon construction of a RaggedExperiment
common or intersecting ranges across samples are not resolved. As in sparseAssay
, the number of rows corresponds to the total number of ranges across all samples with duplicates being allowed. As such, there is a clear 1:1 correspondence between the rows of the assay and the rowData.
@lgeistlinger your desired specifies what the show
method would return, but how would the proposed rowData be used in the API? @mtmorgan, I also don't understand why 'smart'-ness would be required.
Some instances:
rowData
,
e.g. filter all cnv calls not supported by a sufficient number of probes:ind <- rowData(ra)$num.probes > 10
ra <- ra[ind,]
rowData
, e.g. subset the
cnv calls that are an amplificationind <- rowData(ra)$type == "amplification"
ra <- ra[ind,]
gt <- my.snp.data[rowData(ra)$start.probe, "genotype"]
# now do a GWAS
That is useful for approaches that resolve CNVs on the probe level for CNV-phenotype association analysis as e.g. done here: http://zzz.bwh.harvard.edu/plink/cnv.shtml
Some more thoughts:
The full-blown RaggedExperiment
is sparse and of short-living nature.
Most of the times, either disjoinAssay
or qreduceAssay
will be used shortly
after creation of the object to summarize assay values in overlapping ranges across
samples.
Subsequent analysis will then mostly focus on the resulting summarized matrices.
Imho, having rowData
adds to the usefulness of the representation and, in effect,
extends the object's life span (meaning here the time it's actively used within the analysis).
The two main aspects to this are:
preprocessing steps on the ranges such as filtering, subsetting, etc. based on criteria
annotated in the rowData
.
resolving particular ranges in more detail based on criteria annotated in
the rowData
- often summarization as carried out via simplifyDisjoin
and
simplifyReduce
represents a compromise.
Once you have found a number of summarized ranges to be particularly interesting (eg associated with a phenotype) you might want to go back and resolve individual ranges within the summarized ranges in more detail to study the impact of the summarization.
As an example, we had doubt about a particular CNV in this study:
https://www.nature.com/articles/s41598-018-19782-4
Based on SNP data, the region was inferred to be completely deleted (0 copies). Contradictively, we found a gene contained in this region to be highly expressed.
We thus went back to the individual CNV calls and resolved them on SNP level to demonstrate that no SNP contributing to the deletion calls locates directly on the gene, see Figure S3 in here:
Long story short, often enough you are having characteristics describing your
ranges in your full-blown RaggedExperiment that you would like to investigate
in more detail and the rowData
is the typical default for storing these
characteristics given the SummarizedExperiment
paradigm.
Following up on previous comments (https://github.com/Bioconductor/RaggedExperiment/issues/17):
RaggedExperiment
nicely pretends to be aSummarizedExperiment
with regard toassays
,rowRanges
, andcolData
. However,RaggedExperiment
does not seem to support a concept such asrowData
for annotation to itsrowRanges
.I wonder whether this can be modified. When constructing a
RaggedExperiment
from aGRangesList
, would it be possible to distinguish mcols that become assays and mcols that are annotation? (Currently this turns even non-numeric mcols into assays)General motivation for being able to annotate additional row information is as for a
SummarizedExperiment
- however, in my specific use case I'm working with CNV calls where I only want the copy number to become an assay, while I want to keep additional columns as annotation to each individual CNV call (such as probe identifiers, quality measures, etc).