Closed gabeng closed 5 years ago
Ben; Brilliant experiment, I haven't tried varying read lengths like this but would also like to copy in @naumenko-sa who has been doing some incredibly helpful experimentation with decoys (#2489) and might have looked at depth/read-size as well.
In what way to longer read GATK4 indels suffer: both sensitivity and specificity, or just in one metric? A nice way to move forward would be to identify some cases where shorter GATK4 and longer FreeBayes call NA12878 TPs that longer GATK4 misses. If we could put together some small example cases, the GATK4 team might be able to explain what we're seeing and confirm/deny your ideas about read length and filtering. If GATK4 is detecting the variants but they're soft filtered, maybe we could use this as an approach to be less strict and tweak filters with higher average depth.
Thanks again for this work and definitely happy to help figure out how we can make bcbio do better by default in these cases.
Hi Ben!
Those are interesting validation results! Could you please post your measurements (FDR and FNR%) and your bcbio config file? That will help to discuss them. I think you are right that the particular hard filters might influence precision/sensitivity - they are different for gatk4 and freebayes.
I have not done any validations with different read length and different coverage, but one of my colleagues did some validations varying coverage in gatk4. So if you post your results, we may try to review and give some feedback.
gatk4 is really fast - so you can go for WGS, or limit your analysis to one chromosome to get the first approximation. freebayes is not :), correct me, if I'm wrong, but, probably, it is not a practical choice for WGS variant calling.
Sergey
Hi Sergey, hi Brad,
I am happy to share my data with you: I started with a single WES data set from NA12878 sequenced at 2x151 and 139 avg coverage (sample name 151). The average insert size is 242 bases in this data set (a broad distribution).
callable.bed
of 151, 101, 76, used this as validate_regions
and performed alignment and variant calling with GATK4 and Freebayes.Just a side note: Is there any way to better control in which regions the validation is performed? In this experiment it would have been very useful to keep these regions constant no matter what the coverage.
I also tried more aggressive filtering of 151 for adapter content and low-quality bases at the tail. This only decreased sensitivity.
These are the results for GATK4:
Strikingly, the number of TP Indels decreases with longer reads. The absolute numbers are small, so I really need to repeat this with WGS data.
Sample name | 151 | 101_sub | 101 | 76_sub | 76 |
---|---|---|---|---|---|
Cycles | 2x151 | 2x151 | 2x101 | 2x151 | 2x76 |
Avg coverage | 139 | 105 | 103 | 81 | 80 |
Variant Caller | GATK4 | GATK4 | GATK4 | GATK4 | GATK4 |
SNV TP | 23041 | 23034 | 23025 | 23022 | 23006 |
SNV FP | 83 | 84 | 94 | 90 | 123 |
SNV FN | 43 | 50 | 59 | 62 | 78 |
Indel TP | 967 | 965 | 978 | 966 | 979 |
Indel FP | 190 | 189 | 193 | 180 | 200 |
Indel FN | 105 | 107 | 94 | 106 | 93 |
The fn.vcf.gz
from vcf-eval shows 21 soft-filtered variants (9 soft-filtered SNV). I was mistaken in my earlier statement that 50% of Indel FN were soft-filtered.
These are the results for Freebayes:
Freebayes shows a nice gradient towards longer reads and higher coverage.
Sample name | 151 | 101_sub | 101 | 76_sub | 76 |
---|---|---|---|---|---|
Cycles | 2x151 | 2x151 | 2x101 | 2x151 | 2x76 |
Avg coverage | 139 | 105 | 103 | 81 | 80 |
Variant Caller | Freebayes | Freebayes | Freebayes | Freebayes | Freebayes |
SNV TP | 23066 | 23062 | 23062 | 23050 | 23046 |
SNV FP | 251 | 250 | 222 | 251 | 208 |
SNV FN | 18 | 22 | 22 | 34 | 38 |
Indel TP | 950 | 942 | 919 | 937 | 871 |
Indel FP | 67 | 72 | 61 | 80 | 61 |
Indel FN | 122 | 130 | 153 | 135 | 201 |
It's clearer in a picture:
This is the bcbio.yaml:
---
details:
- analysis: variant2
genome_build: GRCh37
algorithm:
aligner: bwa
align_split_size: false
mark_duplicates: true
recalibrate: false
realign: false
variantcaller: [gatk-haplotype,freebayes]
tools_off: [gemini]
effects: false
Software versions are:
bcbio-nextgen,1.1.3a0-848e8de8
freebayes,1.1.0.46
gatk4,4.0.11.0
There is a public data set on basespace (NovaSeq S2: TruSeq DNA Enrichment (IDT xGen Exome Research Panel)) which shows the same behaviour but which has a quite short insert size of 156 bases.
Do you know of any high-coverage 2x151 WGS data sets of GiaB samples other than NA12878 and having a large insert size at the same time? I do not feel comfortable always using NA12878.
Ben; This is really great work, thank you for sharing. In my mind the next step would be to look in depth at the 11 and 12 indels you capture with trimmed reads and miss in the longer original reads. It's small enough that you might be able to identify patterns and determine if this is due to coverage, detection or filtering. That could give us an idea of how to tweak to improve sensitivity.
For your questions:
variant_regions
for the run.Thank you again for this discussion.
Brad, I am attaching the VCF file that contains the FN variants that are FN in 2x151, but not FN in 2x101. It is those where a) the genotype is wrong or b) the soft-filter is set.
Many of them are in repetitive regions. My understanding of the GATK annotations is limited. The GATKCutoffIndel all seem to be filtered because of low AF. Two of the SNVs seem to be filtered by MQRankSum. Otherwise I do not see a pattern there. Any other ideas? 151_fn_annotated_not_101.vcf.gz
Thanks for the link to the data descriptions. I should have read this more carefully. I will try a portion of the NA24631 300x genome 2x250 next. I could then do an additional step in trimming (2x250, 2x150,...).
Hi Ben and Brad!
Ben, thanks for sharing your results! It motivated me to update my validations of WES!
Data:
Sample name | NA12878 |
---|---|
Cycles | 2x100 |
Avg coverage for Ensembl protein coding exons | 156 |
Median bp coverage for Ensembl protein coding exons | 162 |
Reads (single) | 142,641,214 |
Insert size | 190 |
Methods: I intersect callable regions from bcbio x giab bed file x capture kit bed file = 50,436,223 bp in my case. It includes some non-coding sites, coding only should be 30M. Then I use RTG to report baseline-TP, FP, FN. My bcbio config:
details:
- algorithm:
aligner: bwa
effects: false
mark_duplicates: true
realign: false
recalibrate: false
remove_lcr: false
save_diskspace: true
tools_off:
- vqsr
tools_on:
- noalt_calling
variantcaller:
- gatk-haplotype
- freebayes
validate: giab-NA12878/truth_small_variants.vcf.gz
validate_regions: giab-NA12878/truth_regions.bed
Results (I combined with your results for better comparison):
SNPs:
Dataset | NA12878_100 | NA12878_100 | 101_sub | 101 | NA12878_100 | 101_sub | 101 |
---|---|---|---|---|---|---|---|
variant caller | gatk 4.0.12.0 | gatk 4.0.1.2 | gatk 4.0.11.0 | gatk 4.0.11.0 | freebayes 1.1.0.46 | freebayes 1.1.0.46 | freebayes 1.1.0.46 |
tp | 38,614 | 38,627 | 23,034 | 23,025 | 38,649 | 23062 | 23062 |
fp | 203 | 211 | 84 | 94 | 285 | 250 | 222 |
fn | 129 | 116 | 50 | 59 | 94 | 22 | 22 |
FDR | 0.52% | 0.54% | 0.36% | 0.41% | 0.73% | 1.07% | 0.95% |
FNR | 0.33% | 0.30% | 0.22% | 0.26% | 0.24% | 0.10% | 0.10% |
Indels
Dataset | NA12878_100 | NA12878_100 | 101_sub | 101 | NA12878_100 | 101_sub | 101 |
---|---|---|---|---|---|---|---|
variant caller | gatk 4.0.12.0 | gatk 4.0.1.2 | gatk 4.0.11.0 | gatk 4.0.11.0 | freebayes 1.1.0.46 | freebayes 1.1.0.46 | freebayes 1.1.0.46 |
tp | 2874 | 2883 | 965 | 966 | 2776 | 942 | 919 |
fp | 364 | 384 | 189 | 180 | 238 | 72 | 61 |
fn | 170 | 161 | 107 | 106 | 268 | 130 | 153 |
FDR | 11.24% | 11.75% | 16.38% | 15.71% | 7.90% | 7.10% | 6.22% |
FNR | 5.58% | 5.29% | 9.98% | 9.89% | 8.80% | 12.13% | 14.27% |
Target | 38743 | 38743 | 23084 | 23084 | 38743 | 23084 | 23084 |
Conclusions (IMHO):
Sergey
Sergey,
the difference for 2x100 Indel calling is most peculiar. However, I have seen stranger things when using subsets of NA12878. I don't have a good feeling staring at the same 100 Indels all the time. I am getting ready with some chr20 WGS data from other giab samples.
tools_off: vqsr
was the default. Is that not the case? Sure I will try that if it's not off!These are my datasets that I can make available for download for a few days: https://s3-eu-west-1.amazonaws.com/gatk-validation/ExomeNA12878_S1_R1_001.fastq.gz https://s3-eu-west-1.amazonaws.com/gatk-validation/ExomeNA12878_S1_R2_001.fastq.gz
This is the manufacturer's bed file: https://s3-eu-west-1.amazonaws.com/gatk-validation/enrichment_kit_3.bed
This is the bed file used as validate_regions
which was created from the intersection of callable_regions
of the 2x151, 2x101 and 2x76 data sets.
https://s3-eu-west-1.amazonaws.com/gatk-validation/ExomeNA12878_intersection.bed
Are you using the 1.1.2 release version? I will check that out, too.
Sergey,
I repeated the experiment using NA12878 300x WGS data from GIAB cut down to chr20:
Variant Caller | gatk-haplotype | gatk-haplotype | freebayes | freebayes |
---|---|---|---|---|
Cycles | 2x148 | 2x101 | 2x148 | 2x101 |
Panel | WGS chr20 | WGS chr20 | WGS chr20 | WGS chr20 |
Sample | NA12878 | NA12878 | NA12878 | NA12878 |
SNV TP | 66115 | 66171 | 66256 | 66224 |
SNV FP | 42 | 50 | 610 | 568 |
SNV FN | 182 | 173 | 41 | 120 |
SNV FDR | 0.06% | 0.08% | 0.91% | 0.85% |
SNV FNR | 0.27% | 0.26% | 0.06% | 0.18% |
Indel TP | 9989 | 9972 | 9663 | 9355 |
Indel FP | 48 | 71 | 316 | 321 |
Indel FN | 25 | 53 | 351 | 670 |
Indel FDR | 0.48% | 0.71% | 3.17% | 3.32% |
Indel FNR | 0.25% | 0.53% | 3.51% | 6.68% |
It is amazing what the homogeneous, unbiased WGS coverage is worth (compared to targeted sequencing). Still, @naumenko-sa, you achieve better performance on your exome data. Would you be able to provide me with the raw data from your exome? I would like to make sure that this really is an issue of target selection.
@chapmanb: Was your Exome comparison derived from real WES data? Could you provide some details about the data sets (coverage, insert size etc.)?
-- Ben
Sergey,
tools_off: vqsr
did not change a single digit in my results. I therefore assume that it was off all along.
Hi Ben!
Thanks for sharing WGS results. Yes, it is encouraging that WGS improves indel calling that much.
I'm using bcbio1.1.2.
Unfortunately, I'm not allowed to share raw WES data of NA12878. We ruled out vqsr, and you have good sequencing quality. The only explanation of lower FNR/FDR in your experiment, that I might think of, is coverage: you have 100X and in my experiment it was 150X.
Sergey
Sergey,
the avg coverage in my data set was 148x. So I guess it is the selection of target region (the IDT kit is focused on CCDS almost exclusivley). I was wondering how @chapmanb was able to obtain so excellent results with the IDT Exome kit. I guess it must have been much higher coverage. Anyhow, thank you for sharint your results in comparison.
Ben and Sergey; Thanks for this great discussion. You're definitely right that evenness of coverage can help with resolution, especially when you drill in on individual variants. Looking at the variants Ben posted above they appear to be in trickier regions and also right around the cutoffs. I often see this kind of small fluctuation dependent on depth, strand bias or other metrics where you're passing right near the cutoff on one sample set and then failing on the other. This comes from having a standard set of filters, either soft cutoffs or general VQSR training approaches, being not as good as customizing per sample set. On the other side, having a good way to customizing and tuning for each sample set is not easy either.
For the IDT exome capture, I'm afraid I don't have all the details for that in terms of coverage. That was an internal dataset, but I do know that generally IDT capture kits do quite well in validations, primarily due to having more even coverage across target regions. We investigated some and this is due to bait distribution and design, but we focused more on what is captured as kits often have variable targets and we wanted to identify specific regions we needed to see good coverage on.
Sorry this is all more general, but there are a lot of rabbit holes here where I've ended up making practical decisions rather than going all the way down. Hope this is helpful and thanks again for the discussion.
Brad,
The other day I was looking at WES data of different read length and evaluated the performance of GATK4 and freebayes using NA12878. When doing a series of analyses with data trimmed to different read lengths simulating 2x75, 2x100 and 2x150 cycles it turned out that GATK4 Indel calling performance does not improve with longer reads (and higher coverage). In fact, it seemed to slightly decline. Freebayes on the other hand got better with longer reads and higher coverage, as one would expect.
I can think of two reasons why this might be happening:
My question to you is: Have you ever made performance measurements of GATK4 germline calling while varying read length and/or depth? No. 1 may be an issue with WES data exclusively, since insert sizes tend to be much smaller for WES than for WGS. Or do you have any other ideas? Since the number of Indels in WES data is not that high, I am trying to decide if it is worthwhile to check this out with WGS data (when everything takes 100 times longer).
Your feedback is highly appreciated, Ben