Closed schelhorn closed 9 years ago
Sven-Eric -- sorry for any confusion with this. I added additional documentation about structural variant calling and details about the ensemble method:
https://bcbio-nextgen.readthedocs.org/en/latest/contents/pipelines.html#structural-variant-calling
To answer your specific questions:
Hope this helps with understanding the implementation. Let us know if anything else needs additional clarification.
Great, that helps a lot! I really appreciate the effort. However, I still do not understand the list of comma-delimited integers in the cnvkit
output SAMPLENAME-cnvkit.bed
. The entries are of the form:
11 54778990 62299974 SAMPLENAME 3,9 CCDC86,CD5,CD6,CLP1,CNTF,CPSF7,CTD-2331C18.5,CTD-2531D15.4,CTD-2531D15.5,CTD-3051L14.13 (...)
Could you clarify this last bit (the 3,9
) for me, Brad? Thanks.
Sven-Eric -- definitely, I added this to the docs as well. These are regions where there are multiple overlapping calls that get merged into one region. In this case you have amplifications of 3 and 9 predicted within the overall region. Typically these are more complex regions with a lot going on so it's hard to resolve clean breakpoints for events. Hope this helps.
Alright, in that case: bonus question! How come that I have 51 regions in SAMPLENAME-cnvkit.bed
, each of which longer than 2kbp, and only 2 of them made it into SAMPLENAME-sv-ensemble.bed
? Is thery any additional filtering going on? Sorry for being that nitpicky - I'm fine with it not being the perfect, gold-standard do-it-all SV ensemble tool, I just want to understand your reasoning.
Sven-Eric -- not at all, I really appreciate the feedback. Structural variant calling is a lot more difficult than small variant calling, so documenting and improving what we have is exactly the goal of making this available.
I added some additional documentation about the filtering done during ensemble calling. We also remove calls that fall into low complexity regions or high depth regions. The latter are a significant source of false positives for CNV calling, since coverage does look out of whack there but it it more likely due to mapping issues rather than a real event. Tumor/normal comparisons help filter some of these since both will have the spurious event, but if this is a single sample these might be what you see removing them. You can look in the structural/*/ensemble
directories to get a sense of what filters remove your calls. If you notice anything confusing/worrying at all happy to dig into it more. Thanks again.
As a late request, it would be very useful if the zygosity of the SV calls as determined by the non-CNV callers could also be reported in SAMPLENAME-sv-ensemble.bed
. For CNV-callers, this is implicitly done by the copy number value itself. For the other SV-callers such as lumpy
and delly
, these seem to report the GT
of the SV in their respective VCF files (0/0: hom ref, 0/1: het, 1/1: hom alt) . While homozygous SVs are rare, it does happen, and it would be relevant if the SV callers agree not only on the type and position of the SV but also on the zygosity, IMO.
Sven-Eric; We haven't started to explore genotype calling as part of the Ensemble/summarization work in bcbio. We could report this through but we don't have any benchmark sets to evaluate how good/bad we're doing. Additionally, given the current state of SV benchmarks, which I'd like to be more sensitive and specific, I'm not sure adding an extra layer of complexity on top is the best idea. The ensemble/summarization is more to identify regions of interest, then you can dig in and compare genotype calls for potential regions of interest. Hope this helps explains my current thinking on this after the validation work.
Thank you, that was very insightful. I fully agree that it only makes sense to include information if there is some measure of reliability of the data, so excluding ploidy seems to be the safest way for now.
At the moment the two
svcaller
lumpy and delly export VCF files which are easily interpretable. However, this is less so for the post-processed cnvkit results and the SV ensemble results. I would therefore like to ask for a brief documentation of these result, even if the corresponding bcbio modules are not finalized yet.In particular, for cnvkit, bcbio generates a bed file
SAMPLENAME-cnvkit.bed
of the row format:This file can be considered to be the primary output of cnvkit in this context since it is the only one with gene symbol annotation. However, of these fields, the comma-delimited integers are not easily interpretable - they may be absolute integer CNV calls, number of probes, or anything. A short documentation would be very helpful here.
Second, for the ensemble results, again the output of delly and lumpy are easily interpretable (deletions, inversions, duplications etc). However, the cnvkit results of the form
cnv1_cnvkit
are harder to interpret. Are these absolute integer CNV calls?Last, the ensemble voting scheme and notation is not totally clear to me. For instance, for a file
SAMPLENAME-sv-ensemble.bed
:I see some regions being a consensus of two callers while other regions only have one caller to speak for. Since the original results of the single callers seemed to include more regions, how is the selection shown here being made if not by voting (eg., 2 of 3)? And if by voting, how are integer CNV calls (if that is what they are) compared to structural events like deletions, inversions, duplications - are all duplications to expect to increase the integer ploidy call and all deletions expected to decrease that call, and regions are only reported if these calls are consistent across a majority of callers? If so, is the zygosity of SV events in the VCFs considered when doing these computations?