PapenfussLab / StructuralVariantAnnotation

R package designed to simplify structural variant analysis
GNU General Public License v3.0
68 stars 15 forks source link

Subset by multiple seqnames #43

Closed nuttynutmore closed 9 months ago

nuttynutmore commented 10 months ago

Hi there,

I'm interested in using the StructuralVariantAnnotation package to find overlapping SVs within selected regions of interest, across different SV callers (namely GRIDSS and Manta). In particular, it is the ability to retain the single breakends from GRIDSS that I am interested in.

I have read the StructuralVariantAnnotation and VariantAnnotation package documentation carefully and understand that in order to use countBreakpointOverlaps, the single breakends must be self-partnered. The vignettes and example code in the documentation illustrates how to subset a Breakpoint Ranges object by a single chromosome/ seqname, but not how to achieve this with multiple regions of interest.

I have tried several ways of doing this (illustrated below), however, I always get the following error: Error in h(simpleError(msg, call)) : error in evaluating the argument 'x' in selecting a method for function 'as.data.frame': Breakpoint GRanges contains single breakends

Code using a single chromosome/ seqname (which works, as per vignette example):

gridss_vcf <- VariantAnnotation::readVcf("input.vcf")

gridss_gr <- sort(c(breakpointRanges(gridss_vcf), breakendRanges(gridss_vcf)))

gridss_breakpoints <- gridss_gr[seqnames(gridss_gr) == "chr3" |
                                seqnames(partner(gridss_gr, selfPartnerSingleBreakends=TRUE)) == "chr3"]

combined_callers <- c(gridss_breakpoints, manta_breakpoints)
combined_callers$In_GRIDSS <- countBreakpointOverlaps(combined_callers, gridss_breakpoints,
                                                      maxgap=100,
                                                      sizemargin=0.25,
                                                      restrictMarginToSizeMultiple=0.5)

This is what I have tried to subset gridss_gr to include all our regions of interest:

  1. Using a vector of the regions of interest
roi_chr <- c("chr1", "chr2", "chr3")

gridss_breakpoints <- gridss_gr[seqnames(gridss_gr) %in% roi_chr |
                                seqnames(partner(gridss_gr, selfPartnerSingleBreakends=TRUE)) %in% roi_chr]
  1. Using a character string
gridss_breakpoints <- gridss_gr[seqnames(gridss_gr) %in% c("chr1", "chr2", "chr3") |
                                seqnames(partner(gridss_gr, selfPartnerSingleBreakends=TRUE)) %in% c("chr1", "chr2", "chr3")]
  1. Using GenomicRanges subsetByOverlaps
roi_gr <- GRanges(seqnames = c("chr1", "chr2", "chr3"),
                  ranges = IRanges(start = 0, end = 1000000000),
                  strand = "+")

gridss_breakpoints <- subsetByOverlaps(partner(gridss_gr, selfPartnerSingleBreakends = TRUE), roi_gr)
  1. Not subsetting data at all
gridss_gr <- partner(gridss_gr, selfPartnerSingleBreakends = TRUE)

I am using R version 4.1.1 and StructuralVariantAnnotation version 1.8.2

None of these methods seem to work, as I always get the error telling me that there are single breakends. I don't know what else to try and can't seem to find anyone else who has had the same issue. Any advice would be greatly appreciated. Many thanks, Kind regards, Natasha (nuttynumore)

nuttynutmore commented 10 months ago

Update: I have also tried this with R version 4.3.0 and StructuralVarianAnnotation version 1.16.0 and have the same results.

Upon further investigation, it appears the selfPartnerSingleBreakends = TRUE does not work. I have examined the output setting this to TRUE and FALSE and either way, names(bpgr) and bpgr$partner are not equal. bpgr$partner still contains NAs and names(bpgr) does not contain the NA partners.

I have been trying to solve this a different way- from our data we have good reason to believe the BEALN field for many of our single breakends may be the legitimate structural variant partner. Unfortunately, my method involves converting the breakpointRanges object into a data frame and there does not seem to be a way to convert a data frame to a breakpointRanges object (the function only seems to accept VCFs). I also don't seem to be able to convert the data frame to a vcf.

Any insight and advice on the issue is greatly appreciated, Kind regards, Natasha (nuttynutmore)

d-cameron commented 9 months ago

You need to process the breakpoints and single breakends separately. That is,

gridss_be_gr <- breakendRanges(gridss_vcf)
gridss_bp_gr <- breakpointRanges(gridss_vcf)

And then perform StructuralVariantAnnotation breakpoint matching logic on the breakpoints, and GRanges countOverlaps() on the single breakends. For example, this is the relevant benchmarking code for the GRIDSS2 manuscript:

https://github.com/PapenfussLab/gridss/blob/master/scripts/gridss2_manuscript/libbenchmark.R#L135-L184

It distinguishes between a breakpoint match, a breakpoint matching a breakend (at either or both sides) and handles duplicate calls by favoring the highest QUAL breakpoint match.

Unfortunately, there are many knobs one can dial when matching variants and there's no one clearly best option. For example, how closely do the sequences have to match? How much margin do you allow in event size? How much margin do you allow in start/end breakend position? Do you match INS with DUP if they represent the same rearrangement? Do you require sequence matching of the INS and DUP or is an INS of the correct length anywhere in the DUP region considered a match? Does you matching logic correctly match left and right aligned SVs? What about centre-aligned? Do you score precise and imprecise variants in the same manner (i.e. do you favour the caller that perfectly reconstructed the sequence over a caller that just called an imprecise breakpoint with +-100bp margin? How do you handle a breakend match - is it worth 1/2 a TP? What do you two when a caller makes multiple calls for the same TP? What do you do about transitive breakpoints (i.e. if you have A-B-C and a caller reports A-C with inserted sequence of B do you consider that a match to A-B & B-C?)? Do you match a mobile element INS with breakpoints to/from the donor site? What if the caller only called one breakpoint between the donor and insertion site?

All of the above are issues I have personally encountered when SV benchmarking. Unfortunately, there's not always a clearly correct answer. For many of these, it really depends on the purpose of your comparison. SV benchmarking can get very complicated.

nuttynutmore commented 9 months ago

Thank you for your really thorough reply.

Lots to think about!