Provide better explanation of SV caller results, especially for cnvkit and ensemble results

schelhorn commented 9 years ago

At the moment the two svcaller lumpy and delly export VCF files which are easily interpretable. However, this is less so for the post-processed cnvkit results and the SV ensemble results. I would therefore like to ask for a brief documentation of these result, even if the corresponding bcbio modules are not finalized yet.

In particular, for cnvkit, bcbio generates a bed file SAMPLENAME-cnvkit.bed of the row format:

chr start end sample_name comma_delimited_integers comma_delimited_gene_symbols

This file can be considered to be the primary output of cnvkit in this context since it is the only one with gene symbol annotation. However, of these fields, the comma-delimited integers are not easily interpretable - they may be absolute integer CNV calls, number of probes, or anything. A short documentation would be very helpful here.

Second, for the ensemble results, again the output of delly and lumpy are easily interpretable (deletions, inversions, duplications etc). However, the cnvkit results of the form cnv1_cnvkit are harder to interpret. Are these absolute integer CNV calls?

Last, the ensemble voting scheme and notation is not totally clear to me. For instance, for a file SAMPLENAME-sv-ensemble.bed:

11  47644328    47647227    DEL_delly,DEL_lumpy MTCH2
12  6933235 6933351 DEL_delly,DEL_lumpy GPR162
12  104379503   104380725   DEL_delly,DEL_lumpy TDG
19  55450664    55494642    INV_lumpy   CTC-550B14.1,CTC-550B14.8,NLRP2,NLRP7
2   179315136   179315692   DEL_delly,DEL_lumpy PRKRA
22  21562926    21983679    cnv1_cnvkit BCRP6,FAM230C,GGT2,HIC2,KB-1183D5.14,KB-1183D5.16,KB-1183D5.9,PI4KAP2,POM121L8P,PPP1R26P5,RIMBP3B,RIMBP3C,RN7SKP221,RN7SKP63,SCARNA17,SCARNA18,TMEM191C,UBE2L3,YDJC,snoU13
5   140563540   140573621   DEL_lumpy   PCDHB10,PCDHB16,PCDHB9
5   178139360   178311022   INV_lumpy   AACSP1,RP11-21I4.2,ZNF354A,ZNF354B
6   31963925    32012367    cnv1_cnvkit AL645922.1,C4A,C4A-AS1,C4B,C4B-AS1,CYP21A1P,CYP21A2,STK19P,TNXA,TNXB

I see some regions being a consensus of two callers while other regions only have one caller to speak for. Since the original results of the single callers seemed to include more regions, how is the selection shown here being made if not by voting (eg., 2 of 3)? And if by voting, how are integer CNV calls (if that is what they are) compared to structural events like deletions, inversions, duplications - are all duplications to expect to increase the integer ploidy call and all deletions expected to decrease that call, and regions are only reported if these calls are consistent across a majority of callers? If so, is the zygosity of SV events in the VCFs considered when doing these computations?

chapmanb commented 9 years ago

Sven-Eric -- sorry for any confusion with this. I added additional documentation about structural variant calling and details about the ensemble method:

https://bcbio-nextgen.readthedocs.org/en/latest/contents/pipelines.html#structural-variant-calling

To answer your specific questions:

The integers reported for CNVs are the predicted copy numbers. These also get passed through to the ensemble representation.
The ensemble caller is really simple and is meant as a flat representation that is easy to interpret. The algorithm is to pick calls with support from 2 or more callers for small (<2kb) events and keep larger events. This validated best in our methods without getting into anything more complicated. This does not consider genotyping from the SV events and treats CNVs < 2 as deletions and CNVs > 2 as insertions/duplications.

Hope this helps with understanding the implementation. Let us know if anything else needs additional clarification.

schelhorn commented 9 years ago

Great, that helps a lot! I really appreciate the effort. However, I still do not understand the list of comma-delimited integers in the cnvkit output SAMPLENAME-cnvkit.bed. The entries are of the form:

11  54778990    62299974    SAMPLENAME  3,9 CCDC86,CD5,CD6,CLP1,CNTF,CPSF7,CTD-2331C18.5,CTD-2531D15.4,CTD-2531D15.5,CTD-3051L14.13 (...)

Could you clarify this last bit (the 3,9) for me, Brad? Thanks.

chapmanb commented 9 years ago

Sven-Eric -- definitely, I added this to the docs as well. These are regions where there are multiple overlapping calls that get merged into one region. In this case you have amplifications of 3 and 9 predicted within the overall region. Typically these are more complex regions with a lot going on so it's hard to resolve clean breakpoints for events. Hope this helps.

schelhorn commented 9 years ago

Alright, in that case: bonus question! How come that I have 51 regions in SAMPLENAME-cnvkit.bed, each of which longer than 2kbp, and only 2 of them made it into SAMPLENAME-sv-ensemble.bed? Is thery any additional filtering going on? Sorry for being that nitpicky - I'm fine with it not being the perfect, gold-standard do-it-all SV ensemble tool, I just want to understand your reasoning.

chapmanb commented 9 years ago

Sven-Eric -- not at all, I really appreciate the feedback. Structural variant calling is a lot more difficult than small variant calling, so documenting and improving what we have is exactly the goal of making this available.

I added some additional documentation about the filtering done during ensemble calling. We also remove calls that fall into low complexity regions or high depth regions. The latter are a significant source of false positives for CNV calling, since coverage does look out of whack there but it it more likely due to mapping issues rather than a real event. Tumor/normal comparisons help filter some of these since both will have the spurious event, but if this is a single sample these might be what you see removing them. You can look in the structural/*/ensemble directories to get a sense of what filters remove your calls. If you notice anything confusing/worrying at all happy to dig into it more. Thanks again.

schelhorn commented 9 years ago

As a late request, it would be very useful if the zygosity of the SV calls as determined by the non-CNV callers could also be reported in SAMPLENAME-sv-ensemble.bed. For CNV-callers, this is implicitly done by the copy number value itself. For the other SV-callers such as lumpy and delly, these seem to report the GT of the SV in their respective VCF files (0/0: hom ref, 0/1: het, 1/1: hom alt) . While homozygous SVs are rare, it does happen, and it would be relevant if the SV callers agree not only on the type and position of the SV but also on the zygosity, IMO.

chapmanb commented 9 years ago

Sven-Eric; We haven't started to explore genotype calling as part of the Ensemble/summarization work in bcbio. We could report this through but we don't have any benchmark sets to evaluate how good/bad we're doing. Additionally, given the current state of SV benchmarks, which I'd like to be more sensitive and specific, I'm not sure adding an extra layer of complexity on top is the best idea. The ensemble/summarization is more to identify regions of interest, then you can dig in and compare genotype calls for potential regions of interest. Hope this helps explains my current thinking on this after the validation work.

schelhorn commented 9 years ago

Thank you, that was very insightful. I fully agree that it only makes sense to include information if there is some measure of reliability of the data, so excluding ploidy seems to be the safest way for now.

bcbio / bcbio-nextgen

Provide better explanation of SV caller results, especially for cnvkit and ensemble results #872