bcbio / bcbio-nextgen

Validated, scalable, community developed variant calling, RNA-seq and small RNA analysis
https://bcbio-nextgen.readthedocs.io
MIT License
985 stars 353 forks source link

GATK4 performance at higher depth/longer reads / Low freq allele filter #2623

Closed gabeng closed 5 years ago

gabeng commented 5 years ago

Brad,

The other day I was looking at WES data of different read length and evaluated the performance of GATK4 and freebayes using NA12878. When doing a series of analyses with data trimmed to different read lengths simulating 2x75, 2x100 and 2x150 cycles it turned out that GATK4 Indel calling performance does not improve with longer reads (and higher coverage). In fact, it seemed to slightly decline. Freebayes on the other hand got better with longer reads and higher coverage, as one would expect.

I can think of two reasons why this might be happening:

  1. At 2x150 cycles the fraction of overlapping forward and reverse reads is highest. Freebayes and GATK4 deal with these overlapping reads differently, and Freebayes' approach may be working better.
  2. The low freq allele filter applied to the GATK4 exclusively may be optimized for lower depth and may decrease performance at higher depth (or longer reads). Half of the FN indels were soft-filtered.

My question to you is: Have you ever made performance measurements of GATK4 germline calling while varying read length and/or depth? No. 1 may be an issue with WES data exclusively, since insert sizes tend to be much smaller for WES than for WGS. Or do you have any other ideas? Since the number of Indels in WES data is not that high, I am trying to decide if it is worthwhile to check this out with WGS data (when everything takes 100 times longer).

Your feedback is highly appreciated, Ben

chapmanb commented 5 years ago

Ben; Brilliant experiment, I haven't tried varying read lengths like this but would also like to copy in @naumenko-sa who has been doing some incredibly helpful experimentation with decoys (#2489) and might have looked at depth/read-size as well.

In what way to longer read GATK4 indels suffer: both sensitivity and specificity, or just in one metric? A nice way to move forward would be to identify some cases where shorter GATK4 and longer FreeBayes call NA12878 TPs that longer GATK4 misses. If we could put together some small example cases, the GATK4 team might be able to explain what we're seeing and confirm/deny your ideas about read length and filtering. If GATK4 is detecting the variants but they're soft filtered, maybe we could use this as an approach to be less strict and tweak filters with higher average depth.

Thanks again for this work and definitely happy to help figure out how we can make bcbio do better by default in these cases.

naumenko-sa commented 5 years ago

Hi Ben!

Those are interesting validation results! Could you please post your measurements (FDR and FNR%) and your bcbio config file? That will help to discuss them. I think you are right that the particular hard filters might influence precision/sensitivity - they are different for gatk4 and freebayes.

I have not done any validations with different read length and different coverage, but one of my colleagues did some validations varying coverage in gatk4. So if you post your results, we may try to review and give some feedback.

gatk4 is really fast - so you can go for WGS, or limit your analysis to one chromosome to get the first approximation. freebayes is not :), correct me, if I'm wrong, but, probably, it is not a practical choice for WGS variant calling.

Sergey

gabeng commented 5 years ago

Hi Sergey, hi Brad,

I am happy to share my data with you: I started with a single WES data set from NA12878 sequenced at 2x151 and 139 avg coverage (sample name 151). The average insert size is 242 bases in this data set (a broad distribution).

  1. I created two additional data sets from the original one by trimming the reads to 101 and 76 bases, respectively (sample names 101 and 76). Average coverage decreases in these samples accordingly.
  2. Subsequently, I subsampled 151 to match the average coverage of 101 and 76, respectively (sample names 76_sub and 101_sub).
  3. Variant calling was limited to the BED file of the manufacturer (no padding, hence the low number of variants). After a first run I created the intersection of the callable.bed of 151, 101, 76, used this as validate_regions and performed alignment and variant calling with GATK4 and Freebayes.

Just a side note: Is there any way to better control in which regions the validation is performed? In this experiment it would have been very useful to keep these regions constant no matter what the coverage.

I also tried more aggressive filtering of 151 for adapter content and low-quality bases at the tail. This only decreased sensitivity.

These are the results for GATK4:

Strikingly, the number of TP Indels decreases with longer reads. The absolute numbers are small, so I really need to repeat this with WGS data.

Sample name 151 101_sub 101 76_sub 76
Cycles 2x151 2x151 2x101 2x151 2x76
Avg coverage 139 105 103 81 80
Variant Caller GATK4 GATK4 GATK4 GATK4 GATK4
SNV TP 23041 23034 23025 23022 23006
SNV FP 83 84 94 90 123
SNV FN 43 50 59 62 78
Indel TP 967 965 978 966 979
Indel FP 190 189 193 180 200
Indel FN 105 107 94 106 93

The fn.vcf.gz from vcf-eval shows 21 soft-filtered variants (9 soft-filtered SNV). I was mistaken in my earlier statement that 50% of Indel FN were soft-filtered.

These are the results for Freebayes:

Freebayes shows a nice gradient towards longer reads and higher coverage.

Sample name 151 101_sub 101 76_sub 76
Cycles 2x151 2x151 2x101 2x151 2x76
Avg coverage 139 105 103 81 80
Variant Caller Freebayes Freebayes Freebayes Freebayes Freebayes
SNV TP 23066 23062 23062 23050 23046
SNV FP 251 250 222 251 208
SNV FN 18 22 22 34 38
Indel TP 950 942 919 937 871
Indel FP 67 72 61 80 61
Indel FN 122 130 153 135 201

It's clearer in a picture:

fdr-fnr

This is the bcbio.yaml:

---
details:
  - analysis: variant2
    genome_build: GRCh37
    algorithm:
      aligner: bwa
      align_split_size: false
      mark_duplicates: true
      recalibrate: false
      realign: false
      variantcaller: [gatk-haplotype,freebayes]
      tools_off: [gemini]
      effects: false

Software versions are:

bcbio-nextgen,1.1.3a0-848e8de8
freebayes,1.1.0.46
gatk4,4.0.11.0

There is a public data set on basespace (NovaSeq S2: TruSeq DNA Enrichment (IDT xGen Exome Research Panel)) which shows the same behaviour but which has a quite short insert size of 156 bases.

Do you know of any high-coverage 2x151 WGS data sets of GiaB samples other than NA12878 and having a large insert size at the same time? I do not feel comfortable always using NA12878.

chapmanb commented 5 years ago

Ben; This is really great work, thank you for sharing. In my mind the next step would be to look in depth at the 11 and 12 indels you capture with trimmed reads and miss in the longer original reads. It's small enough that you might be able to identify patterns and determine if this is due to coverage, detection or filtering. That could give us an idea of how to tweak to improve sensitivity.

For your questions:

Thank you again for this discussion.

gabeng commented 5 years ago

Brad, I am attaching the VCF file that contains the FN variants that are FN in 2x151, but not FN in 2x101. It is those where a) the genotype is wrong or b) the soft-filter is set.

Many of them are in repetitive regions. My understanding of the GATK annotations is limited. The GATKCutoffIndel all seem to be filtered because of low AF. Two of the SNVs seem to be filtered by MQRankSum. Otherwise I do not see a pattern there. Any other ideas? 151_fn_annotated_not_101.vcf.gz

Thanks for the link to the data descriptions. I should have read this more carefully. I will try a portion of the NA24631 300x genome 2x250 next. I could then do an additional step in trimming (2x250, 2x150,...).

naumenko-sa commented 5 years ago

Hi Ben and Brad!

Ben, thanks for sharing your results! It motivated me to update my validations of WES!

Data:

Sample name NA12878
Cycles 2x100
Avg coverage for Ensembl protein coding exons 156
Median bp coverage for Ensembl protein coding exons 162
Reads (single) 142,641,214
Insert size 190

Methods: I intersect callable regions from bcbio x giab bed file x capture kit bed file = 50,436,223 bp in my case. It includes some non-coding sites, coding only should be 30M. Then I use RTG to report baseline-TP, FP, FN. My bcbio config:

details:
- algorithm:
    aligner: bwa
    effects: false
    mark_duplicates: true
    realign: false
    recalibrate: false
    remove_lcr: false
    save_diskspace: true
    tools_off:
    - vqsr
    tools_on:
    - noalt_calling
    variantcaller:
    - gatk-haplotype
    - freebayes
    validate: giab-NA12878/truth_small_variants.vcf.gz
    validate_regions: giab-NA12878/truth_regions.bed

Results (I combined with your results for better comparison):

SNPs:

Dataset NA12878_100 NA12878_100 101_sub 101 NA12878_100 101_sub 101
variant caller gatk 4.0.12.0 gatk 4.0.1.2 gatk 4.0.11.0 gatk 4.0.11.0 freebayes 1.1.0.46 freebayes 1.1.0.46 freebayes 1.1.0.46
tp 38,614 38,627 23,034 23,025 38,649 23062 23062
fp 203 211 84 94 285 250 222
fn 129 116 50 59 94 22 22
FDR 0.52% 0.54% 0.36% 0.41% 0.73% 1.07% 0.95%
FNR 0.33% 0.30% 0.22% 0.26% 0.24% 0.10% 0.10%

Indels

Dataset NA12878_100 NA12878_100 101_sub 101 NA12878_100 101_sub 101
variant caller gatk 4.0.12.0 gatk 4.0.1.2 gatk 4.0.11.0 gatk 4.0.11.0 freebayes 1.1.0.46 freebayes 1.1.0.46 freebayes 1.1.0.46
tp 2874 2883 965 966 2776 942 919
fp 364 384 189 180 238 72 61
fn 170 161 107 106 268 130 153
FDR 11.24% 11.75% 16.38% 15.71% 7.90% 7.10% 6.22%
FNR 5.58% 5.29% 9.98% 9.89% 8.80% 12.13% 14.27%
Target 38743 38743 23084 23084 38743 23084 23084

Conclusions (IMHO):

Sergey

gabeng commented 5 years ago

Sergey,

the difference for 2x100 Indel calling is most peculiar. However, I have seen stranger things when using subsets of NA12878. I don't have a good feeling staring at the same 100 Indels all the time. I am getting ready with some chr20 WGS data from other giab samples.

These are my datasets that I can make available for download for a few days: https://s3-eu-west-1.amazonaws.com/gatk-validation/ExomeNA12878_S1_R1_001.fastq.gz https://s3-eu-west-1.amazonaws.com/gatk-validation/ExomeNA12878_S1_R2_001.fastq.gz

This is the manufacturer's bed file: https://s3-eu-west-1.amazonaws.com/gatk-validation/enrichment_kit_3.bed

This is the bed file used as validate_regions which was created from the intersection of callable_regions of the 2x151, 2x101 and 2x76 data sets. https://s3-eu-west-1.amazonaws.com/gatk-validation/ExomeNA12878_intersection.bed

Are you using the 1.1.2 release version? I will check that out, too.

gabeng commented 5 years ago

Sergey,

I repeated the experiment using NA12878 300x WGS data from GIAB cut down to chr20:

Variant Caller gatk-haplotype gatk-haplotype freebayes freebayes
Cycles 2x148 2x101 2x148 2x101
Panel WGS chr20 WGS chr20 WGS chr20 WGS chr20
Sample NA12878 NA12878 NA12878 NA12878
SNV TP 66115 66171 66256 66224
SNV FP 42 50 610 568
SNV FN 182 173 41 120
SNV FDR 0.06% 0.08% 0.91% 0.85%
SNV FNR 0.27% 0.26% 0.06% 0.18%
Indel TP 9989 9972 9663 9355
Indel FP 48 71 316 321
Indel FN 25 53 351 670
Indel FDR 0.48% 0.71% 3.17% 3.32%
Indel FNR 0.25% 0.53% 3.51% 6.68%

It is amazing what the homogeneous, unbiased WGS coverage is worth (compared to targeted sequencing). Still, @naumenko-sa, you achieve better performance on your exome data. Would you be able to provide me with the raw data from your exome? I would like to make sure that this really is an issue of target selection.

@chapmanb: Was your Exome comparison derived from real WES data? Could you provide some details about the data sets (coverage, insert size etc.)?

-- Ben

gabeng commented 5 years ago

Sergey,

tools_off: vqsr did not change a single digit in my results. I therefore assume that it was off all along.

naumenko-sa commented 5 years ago

Hi Ben!

Thanks for sharing WGS results. Yes, it is encouraging that WGS improves indel calling that much.

I'm using bcbio1.1.2.

Unfortunately, I'm not allowed to share raw WES data of NA12878. We ruled out vqsr, and you have good sequencing quality. The only explanation of lower FNR/FDR in your experiment, that I might think of, is coverage: you have 100X and in my experiment it was 150X.

Sergey

gabeng commented 5 years ago

Sergey,

the avg coverage in my data set was 148x. So I guess it is the selection of target region (the IDT kit is focused on CCDS almost exclusivley). I was wondering how @chapmanb was able to obtain so excellent results with the IDT Exome kit. I guess it must have been much higher coverage. Anyhow, thank you for sharint your results in comparison.

chapmanb commented 5 years ago

Ben and Sergey; Thanks for this great discussion. You're definitely right that evenness of coverage can help with resolution, especially when you drill in on individual variants. Looking at the variants Ben posted above they appear to be in trickier regions and also right around the cutoffs. I often see this kind of small fluctuation dependent on depth, strand bias or other metrics where you're passing right near the cutoff on one sample set and then failing on the other. This comes from having a standard set of filters, either soft cutoffs or general VQSR training approaches, being not as good as customizing per sample set. On the other side, having a good way to customizing and tuning for each sample set is not easy either.

For the IDT exome capture, I'm afraid I don't have all the details for that in terms of coverage. That was an internal dataset, but I do know that generally IDT capture kits do quite well in validations, primarily due to having more even coverage across target regions. We investigated some and this is due to bait distribution and design, but we focused more on what is captured as kits often have variable targets and we wanted to identify specific regions we needed to see good coverage on.

Sorry this is all more general, but there are a lot of rabbit holes here where I've ended up making practical decisions rather than going all the way down. Hope this is helpful and thanks again for the discussion.