summarization of individual calls/ranges across population

lgeistlinger commented 7 years ago

(In CNV analysis, I assume also for similar data types such as somatic mutation), it is often of interest to summarize invidual calls across the population, (i.e. define CNV regions), for subsequent association analysis with e.g. phenotype data.

https://www.ncbi.nlm.nih.gov/pubmed/19812545 https://www.ncbi.nlm.nih.gov/pubmed/22539667 (Figure 1 for a sketch of the concept)

In the simplest case, this just merges overlapping individual calls into summarized regions. However, this typically inflates CNVR size and trimming low-density areas (usually <10% of the total contributing individual calls within a summarized region) is advisable. Alternatively, a threshold for reciprocal overlap between any two individual calls can also be applied.

That means such a function would take a RaggedExperiment as Input and would return a GRanges object storing the summarized genomic regions.

an example:

grl <- GRangesList( sample1 = GRanges( c("chr1:1-10", "chr2:15-18", "chr2:25-34") ), sample2 = GRanges( c("chr1:1-10", "chr2:11-18" , "chr2:25-36") ), sample3 = GRanges( c("chr1:2-11", "chr2:14-18", "chr2:26-36") ), sample4 = GRanges( c("chr1:1-12", "chr2:18-35" ) ), sample5 = GRanges( c("chr1:1-12", "chr2:11-17" , "chr2:26-34") ) , sample6 = GRanges( c("chr1:1-12", "chr2:12-18" , "chr2:25-35") ) )

re <- RaggedExperiment(grl)

Default (as you would typically like to have it in CNV analysis)

populationRanges(re, density=0.1)

GRanges object with 3 ranges and 0 metadata columns: seqnames ranges strand [1] chr1 [ 1, 12] [2] chr2 [11, 18] [3] chr2 [25, 36] *

density=0 merges all overlapping regions, equivalent to: reduce(unlist(grl))

populationRanges(re, density=0)

GRanges object with 2 ranges and 0 metadata columns: seqnames ranges strand

[1] chr1 [ 1, 12] * [2] chr2 [11, 36] * ### density=1 disjoins all overlapping regions, equivalent to: disjoin(unlist(grl)) > populationRanges(re, density=1) GRanges object with 13 ranges and 0 metadata columns: seqnames ranges strand [1] chr1 [ 1, 1] * [2] chr1 [ 2, 10] * [3] chr1 [11, 11] * [4] chr1 [12, 12] * [5] chr2 [11, 13] * ... ... ... ... [9] chr2 [19, 24] * [10] chr2 [25, 25] * [11] chr2 [26, 34] * [12] chr2 [35, 35] * [13] chr2 [36, 36] * Where as merging and disjoining are suited for the ideal case of matching calls between individuals, technical and biological variation typically results in data with all kinds of overlaps and fuzzy region borders. That would be a way to account for that.

mtmorgan commented 7 years ago

@lgeistlinger Can you confirm your expectations for density=.1?

lgeistlinger commented 7 years ago

@mtmorgan Thanks for inquiring. I have clarified on the above example for density=.1

As I understand the procedure, it will first merge individual calls resulting in the regions

[1] chr1 [ 1, 12] [2] chr2 [11, 36]

where the second region consists of 11 individual calls. It will then check on a base-by-base level on regional density, finding the region

chr2 [19, 24]

present only in 1 of 11 particpating calls, and as 1/11 ~ 0.09 < 0.1 (density threshold) it will be accordingly trimmed.

lgeistlinger commented 7 years ago

@mtmorgan Please be informed that there is also (at least) another frequently chosen approach to summarize individual calls across the population based on reciprocal overlap (and I guess it would be nice to have both available).

reciprocal overlap (RO) approach (e.g. Conrad et al, Nature, 2010)

reciprocal overlap of 0.51 between two CNV calls A and B:

requires that B overlap at least 51% of A, and that A also overlaps at least 51% of B

This is analogous to bedtools intersect -r -f http://bedtools.readthedocs.io/en/latest/content/tools/intersect.html#r-and-f-requiring-reciprocal-minimal-overlap-fraction

(and I wonder whether GRanges might already support such a functionality)

Approach:

At the top level of the hierarchy, all contiguous bases overlapping at least 1bp of a CNV call are merged into a “CNV region” (CNVR). Within each CNVR we further define CNVs with the following algorithm:

Calculate reciprocal overlap (RO) between all remaining calls.
Identify pair of calls with greatest RO. If RO > threshold, merge and create a new CNV (CNV). If not, exit.
Continue adding unclustered calls to the CNV, in order of best overlap. In order to add a call, the new call must have > threshold to all calls within CNV to be added. When no additional calls may be added, move to next step.
If calls remain, return to 1. Otherwise exit.

A typical reciprocal overlap threshold value is 0.5 for constructing CNVs

lgeistlinger commented 7 years ago

RO.pdf

mtmorgan commented 7 years ago

This seems to do most of the trick, except for your expectation when density = 1, which seems to be inconsistent with the definition of returning ranges >= density (there is only one range satisfying that)

populationRanges <-
    function(grl, density)
{
    gr <- unlist(grl)
    cover <- reduce(gr)
    disjoint <- disjoin(gr)

    dj_covered_hits <- subjectHits(findOverlaps(disjoint, cover))
    cover_support <- countOverlaps(cover, gr)[dj_covered_hits]
    ppn <- countOverlaps(disjoint, gr) / cover_support
    reduce(disjoint[ppn >= density])
}

It could be used with RaggedExperiment as populationRanges(rowRanges(re), density = 0.1). I think these could be used to summarize other assays as a 'query' range in qreduceAssay().

lgeistlinger commented 7 years ago

Hm, after thinking that over, it makes indeed more sense that it consistently returns ranges >= density - also for density=1. I just tested it on a cattle population of 723 animals with ~50k individual calls. It works nicely and efficiently. I guess it would be great to have in RaggedExperiment as for the suggested usage on rowRanges and in qreduceAssay.

lgeistlinger commented 7 years ago

For the second approach (merging calls with a reciprocal overlap > xx%), we might be able to recycle some of the thoughts of @lawremi and @hpages on a related problem: https://support.bioconductor.org/p/68021/

mtmorgan commented 7 years ago

Does HelloRanges::bedtools_intersect("-a A.bed -b B.bed -r -f .51") generate code that is useful for the second approach?

We don't think these really belong in RaggedExperiment, which is trying to provide infrastructure for representation & manipulation, rather than specific applications. Not really sure where these belong...

lgeistlinger commented 7 years ago

Well, to me these are cousins of the *Assay functions in RaggedExperiment and are among the things that users typically would like to do with it.

But I see the point that these are going a bit further as eg disjoinAssay, and might thus be considered better placed in an application / package building on RaggedExperiment. At least, it helped me a lot for further dealing with RaggedExperiment, so thanks for that!

lgeistlinger commented 7 years ago

For the sake of completeness, I am also putting here an implementation of the RO approach (although this might not be the appropriate place). There might be more elegant / efficient ways of implementing that, but I found the following to work.

First, I needed a function that, given a set individual calls, returns overlaps (hits) between them that satisfy the RO threshold.

getROHits <- function(calls, ro.thresh=0.5)
{
    # calculate pairwise ro
    hits <- findOverlaps(calls, drop.self=TRUE, drop.redundant=TRUE)

    x <- calls[queryHits(hits)]
    y <- calls[subjectHits(hits)]
    pint <- pintersect(x, y)
    rovlp1 <- width(pint) / width(x)
    rovlp2 <- width(pint) / width(y)

    # keep only hits with ro > threshold
    ind <- rovlp1 > ro.thresh & rovlp2 > ro.thresh
    hits <- hits[ind]

    # exit here if not 2 or more hits
    if(length(hits) < 2) return(hits)    

    rovlp1 <- rovlp1[ind]
    rovlp2 <- rovlp2[ind]

    # order hits by RO 
    pmins <- pmin(rovlp1, rovlp2)
    ind <- order(pmins, decreasing=TRUE)

    qh <- queryHits(hits)
    sh <- subjectHits(hits)    
    hits <- Hits(qh[ind], sh[ind], queryLength(hits), subjectLength(hits))
    mcols(hits)$RO1 <- rovlp1[ind]
    mcols(hits)$RO2 <- rovlp2[ind]

    return(hits)
}

Second, I needed a function that decides whether a given hit can be merged to an already existing cluster - where mergeability requires that all cluster members satisfy the pairwise RO threshold.

isMergeable <- function(hit, cluster, full.blown.hits)
{   
    # (1) check whether query / subject of hit is part of cluster
    curr.qh <- queryHits(hit)
    curr.sh <- subjectHits(hit)

    prev.qh <- queryHits(cluster)
    prev.sh <- subjectHits(cluster)
    prev.members <- union(prev.qh, prev.sh)
    is.part <- c(curr.qh, curr.sh) %in% prev.members

    # (2) can it be merged?    
    mergeable <- FALSE

    # (2a) query *and* subject of hit are part of cluster
    if(all(is.part)) mergeable <- TRUE

    # (2b) query *or* subject of hit are part of cluster
    else if(any(is.part))
    {   
        # check whether the call which is not part of the cluster
        # has sufficient RO with all others in the cluster 
        npart <- c(curr.qh, curr.sh)[!is.part]
        req.hits <- Hits(   rep(npart, length(prev.members)),
                            prev.members, 
                            queryLength(full.blown.hits), 
                            subjectLength(full.blown.hits))

        if(all(req.hits %in% full.blown.hits)) mergeable <- TRUE
    }

    return(mergeable)
}

As the outlined procedure can assign a call to multiple clusters (in the most basic case a call A that has sufficient RO with a call B and a call C, but B and C do not have sufficient RO), I also wanted to optionally strip away such multi-assignments.

pruneMultiAssign <- function(clusters)
{
    cid <- seq_along(clusters)
    times <- sapply(clusters, length)
    cid <- rep(cid, times)
    ind <- unlist(clusters)
    ndup <- !duplicated(ind)
    ind <- ind[ndup]
    cid <- cid[ndup]

    pruned.clusters <- split(ind, cid)
    return(pruned.clusters)
}

Building on these helper functions, the clustering itself then goes sequentially through the identified RO hits, touching each hit once, and checks whether this hit could be merged to already existing clusters.

clusterCalls <- function(calls, ro.thresh=0.5, multi.assign=FALSE)
{
    hits <- getROHits(calls, ro.thresh)

    # exit here if not 2 or more hits
    if(length(hits) < 2) return(hits)

    # worst case: there as many clusters as hits
    cid <- seq_along(hits)
    full.blown.hits <- union(hits, t(hits))

    # touch each hit once and check whether ... 
    # ... it could be merged to a previous cluster
    for(i in 2:length(hits))
    {
        curr.hit <- hits[i]

        # check each previous cluster
        for(j in seq_len(i-1))
        {
            prev.cluster <- hits[cid == j]

            # has this hit already been merged?
            if(!length(prev.cluster)) next

            # if not, check it
            mergeable <- isMergeable(curr.hit, prev.cluster, full.blown.hits)

            if(mergeable)
            {
                cid[i] <- j
                break
            }
        }
    }

    # compile hit clusters 
    hit.clusters <- unname(splitAsList(hits, cid))

    # extract call clusters
    call.clusters <- lapply(hit.clusters,
        function(h) union(queryHits(h), subjectHits(h)))

    # can calls be assigned to more than one cluster?
    if(!multi.assign) call.clusters <- pruneMultiAssign(call.clusters)

    return(call.clusters)
}

This clustering procedure is then invoked on each initial cluster, which are, according to the procedure outlined by Conrad et al., constructed by merging all calls with >= 1 bp overlap.

populationRangesRO <- function(grl, ro.thresh=0.5, multi.assign=FALSE)
{
    gr <- unlist(grl)

    # build initial clusters
    init.clusters <- reduce(gr)

    # cluster within each initial cluster
    cl.per.iclust <- sapply(init.clusters,
        function(ic)
        {
            # get calls of cluster, cluster them, and merge clustered calls
            ccalls <- subsetByOverlaps(gr, ic)
            clusters <- clusterCalls(ccalls, ro.thresh, multi.assign)
            clusters <- range(extractList(ccalls, clusters))
            clusters <- sort(unlist(clusters))
    })
    ro.ranges <- unname(unlist(GRangesList(cl.per.iclust)))
    return(ro.ranges)
}

Applying this function to the above example grl accordingly returns:

> populationRangesRO(grl)
GRanges object with 5 ranges and 0 metadata columns:
      seqnames    ranges strand
         <Rle> <IRanges>  <Rle>
  [1]     chr1  [ 1, 12]      *
  [2]     chr2  [11, 18]      *
  [3]     chr2  [15, 18]      *
  [4]     chr2  [18, 35]      *
  [5]     chr2  [25, 36]      *

lwaldron commented 7 years ago

If these don't go in the package itself, perhaps they could go in the vignette?

LiNk-NY commented 6 years ago

This issue was moved to waldronlab/cnvAnalyzeR#1

Bioconductor / RaggedExperiment