Closed lgeistlinger closed 6 years ago
@lgeistlinger Can you confirm your expectations for density=.1?
@mtmorgan Thanks for inquiring. I have clarified on the above example for density=.1
As I understand the procedure, it will first merge individual calls resulting in the regions
[1] chr1 [ 1, 12] [2] chr2 [11, 36]
where the second region consists of 11 individual calls. It will then check on a base-by-base level on regional density, finding the region
chr2 [19, 24]
present only in 1 of 11 particpating calls, and as 1/11 ~ 0.09 < 0.1 (density threshold) it will be accordingly trimmed.
@mtmorgan Please be informed that there is also (at least) another frequently chosen approach to summarize individual calls across the population based on reciprocal overlap (and I guess it would be nice to have both available).
reciprocal overlap (RO) approach (e.g. Conrad et al, Nature, 2010)
reciprocal overlap of 0.51 between two CNV calls A and B:
requires that B overlap at least 51% of A, and that A also overlaps at least 51% of B
This is analogous to bedtools intersect -r -f http://bedtools.readthedocs.io/en/latest/content/tools/intersect.html#r-and-f-requiring-reciprocal-minimal-overlap-fraction
(and I wonder whether GRanges might already support such a functionality)
Approach:
At the top level of the hierarchy, all contiguous bases overlapping at least 1bp of a CNV call are merged into a “CNV region” (CNVR). Within each CNVR we further define CNVs with the following algorithm:
Calculate reciprocal overlap (RO) between all remaining calls.
Identify pair of calls with greatest RO. If RO > threshold, merge and create a new CNV (CNV). If not, exit.
Continue adding unclustered calls to the CNV, in order of best overlap. In order to add a call, the new call must have > threshold to all calls within CNV to be added. When no additional calls may be added, move to next step.
If calls remain, return to 1. Otherwise exit.
A typical reciprocal overlap threshold value is 0.5 for constructing CNVs
This seems to do most of the trick, except for your expectation when density = 1, which seems to be inconsistent with the definition of returning ranges >= density (there is only one range satisfying that)
populationRanges <-
function(grl, density)
{
gr <- unlist(grl)
cover <- reduce(gr)
disjoint <- disjoin(gr)
dj_covered_hits <- subjectHits(findOverlaps(disjoint, cover))
cover_support <- countOverlaps(cover, gr)[dj_covered_hits]
ppn <- countOverlaps(disjoint, gr) / cover_support
reduce(disjoint[ppn >= density])
}
It could be used with RaggedExperiment as populationRanges(rowRanges(re), density = 0.1)
. I think these could be used to summarize other assays as a 'query' range in qreduceAssay()
.
Hm, after thinking that over, it makes indeed more sense that it consistently returns ranges >= density - also for density=1. I just tested it on a cattle population of 723 animals with ~50k individual calls. It works nicely and efficiently. I guess it would be great to have in RaggedExperiment as for the suggested usage on rowRanges
and in qreduceAssay
.
For the second approach (merging calls with a reciprocal overlap > xx%), we might be able to recycle some of the thoughts of @lawremi and @hpages on a related problem: https://support.bioconductor.org/p/68021/
Does HelloRanges::bedtools_intersect("-a A.bed -b B.bed -r -f .51")
generate code that is useful for the second approach?
We don't think these really belong in RaggedExperiment, which is trying to provide infrastructure for representation & manipulation, rather than specific applications. Not really sure where these belong...
Well, to me these are cousins of the *Assay functions in RaggedExperiment and are among the things that users typically would like to do with it.
But I see the point that these are going a bit further as eg disjoinAssay, and might thus be considered better placed in an application / package building on RaggedExperiment. At least, it helped me a lot for further dealing with RaggedExperiment, so thanks for that!
For the sake of completeness, I am also putting here an implementation of the RO approach (although this might not be the appropriate place). There might be more elegant / efficient ways of implementing that, but I found the following to work.
First, I needed a function that, given a set individual calls, returns overlaps (hits) between them that satisfy the RO threshold.
getROHits <- function(calls, ro.thresh=0.5)
{
# calculate pairwise ro
hits <- findOverlaps(calls, drop.self=TRUE, drop.redundant=TRUE)
x <- calls[queryHits(hits)]
y <- calls[subjectHits(hits)]
pint <- pintersect(x, y)
rovlp1 <- width(pint) / width(x)
rovlp2 <- width(pint) / width(y)
# keep only hits with ro > threshold
ind <- rovlp1 > ro.thresh & rovlp2 > ro.thresh
hits <- hits[ind]
# exit here if not 2 or more hits
if(length(hits) < 2) return(hits)
rovlp1 <- rovlp1[ind]
rovlp2 <- rovlp2[ind]
# order hits by RO
pmins <- pmin(rovlp1, rovlp2)
ind <- order(pmins, decreasing=TRUE)
qh <- queryHits(hits)
sh <- subjectHits(hits)
hits <- Hits(qh[ind], sh[ind], queryLength(hits), subjectLength(hits))
mcols(hits)$RO1 <- rovlp1[ind]
mcols(hits)$RO2 <- rovlp2[ind]
return(hits)
}
Second, I needed a function that decides whether a given hit can be merged to an already existing cluster - where mergeability requires that all cluster members satisfy the pairwise RO threshold.
isMergeable <- function(hit, cluster, full.blown.hits)
{
# (1) check whether query / subject of hit is part of cluster
curr.qh <- queryHits(hit)
curr.sh <- subjectHits(hit)
prev.qh <- queryHits(cluster)
prev.sh <- subjectHits(cluster)
prev.members <- union(prev.qh, prev.sh)
is.part <- c(curr.qh, curr.sh) %in% prev.members
# (2) can it be merged?
mergeable <- FALSE
# (2a) query *and* subject of hit are part of cluster
if(all(is.part)) mergeable <- TRUE
# (2b) query *or* subject of hit are part of cluster
else if(any(is.part))
{
# check whether the call which is not part of the cluster
# has sufficient RO with all others in the cluster
npart <- c(curr.qh, curr.sh)[!is.part]
req.hits <- Hits( rep(npart, length(prev.members)),
prev.members,
queryLength(full.blown.hits),
subjectLength(full.blown.hits))
if(all(req.hits %in% full.blown.hits)) mergeable <- TRUE
}
return(mergeable)
}
As the outlined procedure can assign a call to multiple clusters (in the most basic case a call A that has sufficient RO with a call B and a call C, but B and C do not have sufficient RO), I also wanted to optionally strip away such multi-assignments.
pruneMultiAssign <- function(clusters)
{
cid <- seq_along(clusters)
times <- sapply(clusters, length)
cid <- rep(cid, times)
ind <- unlist(clusters)
ndup <- !duplicated(ind)
ind <- ind[ndup]
cid <- cid[ndup]
pruned.clusters <- split(ind, cid)
return(pruned.clusters)
}
Building on these helper functions, the clustering itself then goes sequentially through the identified RO hits, touching each hit once, and checks whether this hit could be merged to already existing clusters.
clusterCalls <- function(calls, ro.thresh=0.5, multi.assign=FALSE)
{
hits <- getROHits(calls, ro.thresh)
# exit here if not 2 or more hits
if(length(hits) < 2) return(hits)
# worst case: there as many clusters as hits
cid <- seq_along(hits)
full.blown.hits <- union(hits, t(hits))
# touch each hit once and check whether ...
# ... it could be merged to a previous cluster
for(i in 2:length(hits))
{
curr.hit <- hits[i]
# check each previous cluster
for(j in seq_len(i-1))
{
prev.cluster <- hits[cid == j]
# has this hit already been merged?
if(!length(prev.cluster)) next
# if not, check it
mergeable <- isMergeable(curr.hit, prev.cluster, full.blown.hits)
if(mergeable)
{
cid[i] <- j
break
}
}
}
# compile hit clusters
hit.clusters <- unname(splitAsList(hits, cid))
# extract call clusters
call.clusters <- lapply(hit.clusters,
function(h) union(queryHits(h), subjectHits(h)))
# can calls be assigned to more than one cluster?
if(!multi.assign) call.clusters <- pruneMultiAssign(call.clusters)
return(call.clusters)
}
This clustering procedure is then invoked on each initial cluster, which are, according to the procedure outlined by Conrad et al., constructed by merging all calls with >= 1 bp overlap.
populationRangesRO <- function(grl, ro.thresh=0.5, multi.assign=FALSE)
{
gr <- unlist(grl)
# build initial clusters
init.clusters <- reduce(gr)
# cluster within each initial cluster
cl.per.iclust <- sapply(init.clusters,
function(ic)
{
# get calls of cluster, cluster them, and merge clustered calls
ccalls <- subsetByOverlaps(gr, ic)
clusters <- clusterCalls(ccalls, ro.thresh, multi.assign)
clusters <- range(extractList(ccalls, clusters))
clusters <- sort(unlist(clusters))
})
ro.ranges <- unname(unlist(GRangesList(cl.per.iclust)))
return(ro.ranges)
}
Applying this function to the above example grl
accordingly returns:
> populationRangesRO(grl)
GRanges object with 5 ranges and 0 metadata columns:
seqnames ranges strand
<Rle> <IRanges> <Rle>
[1] chr1 [ 1, 12] *
[2] chr2 [11, 18] *
[3] chr2 [15, 18] *
[4] chr2 [18, 35] *
[5] chr2 [25, 36] *
If these don't go in the package itself, perhaps they could go in the vignette?
This issue was moved to waldronlab/cnvAnalyzeR#1
(In CNV analysis, I assume also for similar data types such as somatic mutation), it is often of interest to summarize invidual calls across the population, (i.e. define CNV regions), for subsequent association analysis with e.g. phenotype data.
https://www.ncbi.nlm.nih.gov/pubmed/19812545 https://www.ncbi.nlm.nih.gov/pubmed/22539667 (Figure 1 for a sketch of the concept)
In the simplest case, this just merges overlapping individual calls into summarized regions. However, this typically inflates CNVR size and trimming low-density areas (usually <10% of the total contributing individual calls within a summarized region) is advisable. Alternatively, a threshold for reciprocal overlap between any two individual calls can also be applied.
That means such a function would take a RaggedExperiment as Input and would return a GRanges object storing the summarized genomic regions.
an example:
grl <- GRangesList( sample1 = GRanges( c("chr1:1-10", "chr2:15-18", "chr2:25-34") ), sample2 = GRanges( c("chr1:1-10", "chr2:11-18" , "chr2:25-36") ), sample3 = GRanges( c("chr1:2-11", "chr2:14-18", "chr2:26-36") ), sample4 = GRanges( c("chr1:1-12", "chr2:18-35" ) ), sample5 = GRanges( c("chr1:1-12", "chr2:11-17" , "chr2:26-34") ) , sample6 = GRanges( c("chr1:1-12", "chr2:12-18" , "chr2:25-35") ) )
re <- RaggedExperiment(grl)
Default (as you would typically like to have it in CNV analysis)
GRanges object with 3 ranges and 0 metadata columns: seqnames ranges strand [1] chr1 [ 1, 12] [2] chr2 [11, 18] [3] chr2 [25, 36] *
density=0 merges all overlapping regions, equivalent to: reduce(unlist(grl))
GRanges object with 2 ranges and 0 metadata columns: seqnames ranges strand