PapenfussLab / StructuralVariantAnnotation

R package designed to simplify structural variant analysis
GNU General Public License v3.0
68 stars 15 forks source link

Add more vignette examples explaining `findBreakpointOverlaps()` parameters. #23

Closed oalavijeh closed 4 years ago

oalavijeh commented 4 years ago

Dear Daniel, I was trying to use SVA and had a few questions (that I'm sure are very simple so apologies):

suppressPackageStartupMessages(require(StructuralVariantAnnotation)) suppressPackageStartupMessages(require(VariantAnnotation)) vcf.file <- system.file("extdata", "gridss.vcf", package = "StructuralVariantAnnotation") vcf <- VariantAnnotation::readVcf(vcf.file, "hg19") gr <- breakpointRanges(vcf)

at "vcf <- VariantAnnotation::readVcf(vcf.file, "hg19")" I get "file does not exist" message. Is the hg19 a file from the VariantAnnotation package or a different reference file we are meant to provide?

Many thanks for your help

cbrueffer commented 4 years ago

Hi @oalavijeh , I can at least answer your last question. You need the BSgenome.Hsapiens.UCSC.hg19 bioconductor package to make this work, that's what the hg19 refers to. See https://bioconductor.org/packages/release/bioc/manuals/VariantAnnotation/man/VariantAnnotation.pdf for details.

d-cameron commented 4 years ago

or just use vcf <- VariantAnnotation::readVcf(vcf.file) without specifying a genome.

Can SVA find those that overlap by a certain percentage?

Yes. This is what findBreakpointOverlaps() does. Essentially, it does what findOverlaps() does but it calculates overlap for the two breakends. In practice, this means that for simple events the terminology is different. Since the comparison is between the intervals of uncertainty around the start and end of the SVs, maxgap means the maximum distance the start|end can be away from each other before they are considered a match, and minoverlap is kind of useless and just kept for API consistency with findOverlaps().

Since SVA can handle arbitrary SVs, it represents them not as {type, chr, start, end}, but as {breakend1, breakend2}. That is, {(chr, lower_pos_bound, upper_pos_bound, orientation), (chr, lower_pos_bound, upper_pos_bound, orientation)}. lower_pos_bound and upper_pos_bound for a breakend can be different either due to the call itself being IMPRECISE (e.g. RP based callers), or due to a precise variant having a microhomology at the breakpoint position (e.g. flank polyA on both sides of both breakends)

In summary:

Can SVA then merge these?

No. Merging identical events is trivial but there's too many choices to make when merging SVs that do not match exactly. For example, if you have a 50bp maxgap you have have event A overlap B, B overlap C, but A not overlap C. There's no right answer about how to merge these events.

Example: A: DEL [100, 150] B: DEL [110, 160] C: DEL [120, 170] D: DEL [130, 180] E: DEL [140, 190] F: DEL [150, 200] G: DEL [160, 210]

With maxgap=10, each event overlaps the next event, but by the time you get to G, there's no common deleted bases at all. There's no right answer about what to do in such scenario so I have intentionally excluded such functionality from SVA.

oalavijeh commented 4 years ago

Thanks Daniel and Christian for the very helpful answers!