fritzsedlazeck / SURVIVOR

Toolset for SV simulation, comparison and filtering
MIT License
344 stars 46 forks source link

GRIDSS support #92

Open d-cameron opened 5 years ago

d-cameron commented 5 years ago

Hello

I am looking at adding GRIDSS to parliament, but it appears that survivor does not support GRIDSS output. AFAIK, GRIDSS is fully compliant with VCFv4.3 but there are a number of issues I've encountered:

Is there any chance SURVIVOR could be updated to handle GRIDSS VCFs?

fritzsedlazeck commented 5 years ago

Hi, sorry for the late reply. So SURVIVOR currently tries to translate a BND event either in an INV or in a TRA event internally. This is determined if the two breakpoints are on the same chr or not.

So only the first point of your list is basically relevant and also only if you require a type save merge. This is because a DEL is for me not the same as a BND although I know that a BND can be everything according to the standard.

I dont want to open up this constraint to be type save especially since it is a parameter already. I hope that makes sense. Thanks Fritz

zlye commented 4 years ago

@d-cameron - did you ever find a solution to this? I am trying to merge SVs from a population discovered via GRIDSS and would be helpful to use survivor

d-cameron commented 4 years ago

No I didn't get around to doing this. I did a lot of benchmarking (PMID: 31324872) and came to the conclusion that a) GRIDSS and manta (the ones that incorporate assembly) are the only tools really worth running, b) X of Y ensembles are pretty bad, and c) the value in ensemble calling is in the additional work that is done (e.g. parliament running SVTyper).

Good ensemble callers seem to effectively turn into (targetted) callers in their own right, so my time was better spent making GRIDSS better, and improving the downstream interpretation (e.g. see https://www.biorxiv.org/content/10.1101/781013v1 for our work on complex event interpretation)

d-cameron commented 4 years ago

If it's just GRIDSS calls from multiple samples, merging should be relatively straight-forward since you don't have to deal with the mess of merging from multiple callers each of which uses their own dialect. I use BioConductor StructuralVariantAnnotation package for this, although if you're dealing with hundreds or thousands of samples R would probably run out of memory when jointly processing full GRIDSS unfiltered call sets (since the full VCFs are ~1GB per sample)

fritzsedlazeck commented 4 years ago

Thanks @d-cameron . I think my only problem with GRIDSS is that I find it hard to interpret that all events are BND. It is true that this is standard confirm, but how do I distinguish if a gene is e.g. deleted or inverted? I never got around to see what information GRIDSS provides to distinguish that.

As you can see from the parliament paper we did also a bunch of benchmarks. You are right Manta does a good job, but the other SV caller also improves above that for certain SV that are missed initially.

I will read up your paper, but would be great if you can extend ob your points that ensembles methods are not worth the time. Does not seem to be the case for everyone else, but maybe we are overlooking something.

Thanks Fritz

zlye commented 4 years ago

@d-cameron I know a population can be merged using structural variant annotation R, but it would be nice to have them as a VCF for a few reasons - to have the VAF calculations that you suggested for genotyping and to use the VCF format without downstream analysis that require a VCF format. Indeed I am working with 100s of samples, (that expect to be prohibitive to R memory) and that is why I'm hoping to merge them as VCF using an existing tools (like SURVIVOR). Do you have any recommendation? Can this actually be done with StructuralVariantAnnotation - I have had problems trying to use the "breakpointGRangesToVCF" function? I appreciate your input before I launch into writing some custom scripts.

fritzsedlazeck commented 4 years ago

@zlye are you just using GRIDSS? I believe the merge across samples should still work. Can you try a subset and let me know if not? Please pay attention if SURVIVOR wrongly converts BND to INV. Thanks Fritz

d-cameron commented 4 years ago

but the other SV caller also improves above that for certain SV that are missed initially.

We also found you could improve performance with a simple X of Y ensemble. The reason that I'm so pessimistic about simple ensemble calling is that that a) all the best ensembles had both manta and GRIDSS, b) the top ensembles had too many callers (e.g 7 of 10, or 6 of 9), and c) the top ensembles were not stable - the 'best' for HG002 could be worse than GRIDSS alone for NA12878.

I think my only problem with GRIDSS is that I find it hard to interpret that all events are BND.

By design, it doesn't. Fundamentally, GRIDSS is a breakpoint caller and since all it does it detect breakpoints. My choice of BND notation was intentional as it makes it quite clear what the actual claim is. Calling a deletion from breakpoint support alone is misleading because you haven't actually checked that the segment is actually deleted - it could be part of a copy number neutral complex rearrangement. For pretty much every NGS caller (lumpy being an exception), DEL just means a an intra-chromosomal breakpoint in +- orientation.

how do I distinguish if a gene is e.g. deleted or inverted?

You can't and shouldn't do that by just iteratively looking individual breakpoints in isolation. Here is an example of a complex event that GRIDSS/PURPLE/LINX fully resolves (the outer line is the reconstructed derivative chromosome and the inner tracks indicate CN and LOH): image

Want to know what happened to you gene of interest? You need to know everything that happens in that genomic region.

Fundamentally, the problem is due to a lack of separation of concerns. Primitive (breakpoint, CN) identification and event interpretation have been confounded and treated as the same thing when in fact they are two very different things. We're actually discussing this in the GA4GH file formats group and I'm currently drafting up some big changes to SV representation for VCFv4.4. If you have time, let me know what you think of the proposed changes (https://github.com/samtools/hts-specs/pull/465) as it'd be great to have eyeballs familar with the mess that is SV calling looking at it.

I will read up your paper, but would be great if you can extend ob your points that ensembles methods are not worth the time. Does not seem to be the case for everyone else, but maybe we are overlooking something.

One of the PhD students in our lab is doing ML ensemble calling and the getting really good results out of that. The thing is, she's also getting really good results with single-caller ML quality score recalibration. Ensemble-ness itself doesn't actually give you that much - it's all about running a sensitive caller and controlling your FP rate. I wouldn't be surprised if she was able to outperform all traditional ensembles with a single-caller ML qual recalibration using manta, GRIDSS, or pindel. Yes, you'll get slightly better performance by doing an ensemble on top of that, but you'll get most of the way there off a single caller.

A good example of why tuning callers to your data is the PCAWG consensus somatic call set vs somatic GRIDSS/PURPLE/LINX on the Hartwig metastatic cancer cohort. PCAWG spent a lot of time and effort on the ensemble meta-caller - we spent a lot of effort tuning our single caller pipeline for our cohort. Using unexplained CN transition as a proxy for sensitivity, we found that we missed 2% of somatic SVs, whereas PCAWG missed 20%. Subsetting to PCAWG clonal calls for a more fair comparison (we did deeper sequencing), they still missed 13%. Better algorithms beat ensemble calling every time. Why did we do so well? Well, when 7% of your calls are to repetative sequence (e.g. centromeres), ensemble calling doesn't help for those calls - you need a caller that reports single breakends (hence why GRIDSS2 does single breakend calling). 7% deeper sequencing, 7% novel reporting, 1% other novel features, 3% tuning & better calling. Ensemble calling only addresses that last 3%.

All that said, where ensemble calling does shine is in merging disparate callers. Take an SVMerge style approach and merging the results from a general purpose SV caller, a retrotransposon caller, and a microsatellite caller, a viral/novel insertion caller and an indel caller makes a lot of sense. Combining multiple callers that use the same evidence and perform the same steps gives you a much lower return on your investment. These callers will tend to have similar results, including sharing the same false positives (PMID: 31324872 Fig 3b). It's simpler to just run a good caller and spend your time identifying and filtering systematic false positives.

d-cameron commented 4 years ago

I get the distinct feeling that ensemble calling is popular amougst the large consortia for political reasons. The last time I attempting to convince a consortium I was involved in to switch pipelines I got so much pushback that I abandoned that attempt and it turned into just one more caller for their ensemble.

ohofmann commented 4 years ago

Nice writeup. I still value in ensemble calls for cases where you need an all purpose method and are not able to tweak parameter settings to your specific library prep / sample quality / sequencing method, but that's an aside. Are you planning to put some of those examples on a blog? Would be really good to have a writeup somewhere!

d-cameron commented 4 years ago

@d-cameron I know a population can be merged using structural variant annotation R, but it would be nice to have them as a VCF for a few reasons - to have the VAF calculations that you suggested for genotyping and to use the VCF format without downstream analysis that require a VCF format. Indeed I am working with 100s of samples, (that expect to be prohibitive to R memory) and that is why I'm hoping to merge them as VCF using an existing tools (like SURVIVOR).

Do you have any recommendation? Can this actually be done with StructuralVariantAnnotation - I have had problems trying to use the "breakpointGRangesToVCF" function? I appreciate your input before I launch into writing some custom scripts.

breakpointGRangesToVCF doesn't work (yet).

If I had your cohort, I'd do, GRIDSS joint multisample calling on any related samples then single-sample filtering to remove the low VAF noise and get the files to a managable size. From there you can either run SURVIVOR (assuming it runs happily), or use StructuralVariantAnnotation to calculate mutual overlap. You might get away with just renaming the shared variants to have the same name in all samples then something like vcfools to merge into a single VCF (disclaimer: do not know if this works - have not tried it myself). Depending on what you want to do, it might even be easiest not to bother with merging and just use the ID field for uniqness (same ID = same variant) in much the same manner dbSNP as unique variant identifiers.

VCF doesn't scale well to many samples. Since both # variants and # FORMAT columns scale with the number of sample, space used scales super-linearly (mostly full of lots of zeros for rare variants). It's something that's being looked at for VCF currently looking at.

zlye commented 4 years ago

Thanks for your insight - I am not totally sure what you meant when you said:

"Depending on what you want to do, it might even be easiest not both with merging and just use the ID field for uniqness (same ID = same variant) in much the same manner dbSNP as unique variant identifiers."

On Sun, Mar 22, 2020 at 5:26 AM Daniel Cameron notifications@github.com wrote:

@d-cameron https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_d-2Dcameron&d=DwMCaQ&c=slrrB7dE8n7gBJbeO0g-IQ&r=163oVAfqYW8rLwmeRfPScQ&m=8wwrSkCzVHJAYdDY47d-scV0JK34rXouhcO6twfdg3Y&s=64Wk7SRcOcuITXMdh_BITRpmxpKO5aMFuu-dGaVEiPI&e= I know a population can be merged using structural variant annotation R, but it would be nice to have them as a VCF for a few reasons - to have the VAF calculations that you suggested for genotyping and to use the VCF format without downstream analysis that require a VCF format. Indeed I am working with 100s of samples, (that expect to be prohibitive to R memory) and that is why I'm hoping to merge them as VCF using an existing tools (like SURVIVOR).

Do you have any recommendation? Can this actually be done with StructuralVariantAnnotation - I have had problems trying to use the "breakpointGRangesToVCF" function? I appreciate your input before I launch into writing some custom scripts.

breakpointGRangesToVCF doesn't work (yet).

If I had your cohort, I'd do, GRIDSS joint multisample calling on any related samples then single-sample filtering to remove the low VAF noise and get the files to a managable size. From there you can either run SURVIVOR (assuming it runs happily), or use StructuralVariantAnnotation to calculate mutual overlap. You might get away with just renaming the shared variants to have the same name in all samples then something like vcfools to merge into a single VCF (disclaimer: do not know if this works - have not tried it myself). Depending on what you want to do, it might even be easiest not both with merging and just use the ID field for uniqness (same ID = same variant) in much the same manner dbSNP as unique variant identifiers.

VCF doesn't scale well to many samples. Since both # variants and # FORMAT columns scale with the number of sample, space used scales super-linearly (mostly full of lots of zeros for rare variants). It's something that's being looked at for VCF currently looking at.

SURIVIOR

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_fritzsedlazeck_SURVIVOR_issues_92-23issuecomment-2D602170352&d=DwMCaQ&c=slrrB7dE8n7gBJbeO0g-IQ&r=163oVAfqYW8rLwmeRfPScQ&m=8wwrSkCzVHJAYdDY47d-scV0JK34rXouhcO6twfdg3Y&s=Zb6VwasND_6pLSBOPuYNTRTt6_AF7tdjiHbneJazOLE&e=, or unsubscribe https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_AKFUYU4DWETS2HY42W6SSUDRIXKWBANCNFSM4HXEWEUA&d=DwMCaQ&c=slrrB7dE8n7gBJbeO0g-IQ&r=163oVAfqYW8rLwmeRfPScQ&m=8wwrSkCzVHJAYdDY47d-scV0JK34rXouhcO6twfdg3Y&s=P8Lsn5QLaX70alIiVB4gQBKqblFiHtIsw-Vj64ZqQvY&e= .

-- Zoe Lye P.h.D. candidate 12 Waverly Place New York University New York, NY 10003

fritzsedlazeck commented 4 years ago

Sorry just catching up. Thanks for the insights @d-cameron

@DEL vs. INV: you can detect if a region is deleted or not. Just based on coverage. Manta etc. has that included. It is true that some of the larger variations get tricky...

@Ensamble caller: I think the real strength of the ensemble calling is maybe not sensitivity (I think you are focusing on that and many other callers (MetaSV)) but precision. Not to use a union but a consensus. If two caller agree on the same SV its more likely to be correct. Exceptions are alignment artifacts, but that is always problematic.

Franky, I think you are quite deep in cancer calling these breakpoints. Which is fair and important. General SV calling also needs to detect SV that plays importance in population settings etc. You cannot just say there are no deletions and duplications. This is simply not true. Sorry if I misunderstood.

With VCF scaling: that is true. The same problem goes for SNV. There are some techniques but hard to implement for SVs (eg. gVCF).

@GRIDSS: when you say by default it does not let you detect del vs. bnd: Do you have a parameter?

Thanks Fritz

d-cameron commented 4 years ago

If two caller agree on the same SV its more likely to be correct.

This issue with this approach is that similar callers make the same FP calls (https://www.nature.com/articles/s41467-019-11146-4/figures/3). The tail of TPs called by few callers is smaller than the head of FP calls called by many callers. That said, those results are NA12878 so if those FPs are actually TP omissions from the true set used in the benchmark, this will be less of a problem. Ensemble calling works well when your input caller call different calls. GRIDSS and manta are already close to the pareto-optimial curve for simple ensembles so the value of a simple ensemble over those callers is much less than for callers such as breakdancer, delly and lumpy.

when you say by default it does not let you detect del vs. bnd: Do you have a parameter?

GRIDSS reports all breakpoints in BND notation.

del vs. INV: you can detect if a region is deleted or not. Just based on coverage. Manta etc. has that included. It is true that some of the larger variations get tricky...

I was under the impression that it was only lumpy that actually looked at the depth of the deleted segment. There's nothing about coverage/read depth filtering in either than manta user manual or manuscript. It has a high-depth filter and reports coverage at both breakends, but it doesn't actually check a DEL has a CN loss or a DUP has a CN gain over the segment in question.

Franky, I think you are quite deep in cancer calling these breakpoints. Which is fair and important. General SV calling also needs to detect SV that plays importance in population settings etc. You cannot just say there are no deletions and duplications. This is simply not true. Sorry if I misunderstood.

Even in the cancer setting most variants are simple DEL or DUP. Assuming a +- breakpoint is a DEL, and a -+ breakpoint is a DUP is an assumption that is probably correct for somatic analysis and almost always correct for germline analysis. The reason GRIDSS reports in BND notation is because that's what the GRIDSS evidence supports. A DEL call is an implicit claim of both a +- breakpoint and a copy number loss. If a caller hasn't check both of these, it shouldn't make a DEL call.