GATK4 performance at higher depth/longer reads / Low freq allele filter

gabeng commented 5 years ago

Brad,

The other day I was looking at WES data of different read length and evaluated the performance of GATK4 and freebayes using NA12878. When doing a series of analyses with data trimmed to different read lengths simulating 2x75, 2x100 and 2x150 cycles it turned out that GATK4 Indel calling performance does not improve with longer reads (and higher coverage). In fact, it seemed to slightly decline. Freebayes on the other hand got better with longer reads and higher coverage, as one would expect.

I can think of two reasons why this might be happening:

At 2x150 cycles the fraction of overlapping forward and reverse reads is highest. Freebayes and GATK4 deal with these overlapping reads differently, and Freebayes' approach may be working better.
The low freq allele filter applied to the GATK4 exclusively may be optimized for lower depth and may decrease performance at higher depth (or longer reads). Half of the FN indels were soft-filtered.

My question to you is: Have you ever made performance measurements of GATK4 germline calling while varying read length and/or depth? No. 1 may be an issue with WES data exclusively, since insert sizes tend to be much smaller for WES than for WGS. Or do you have any other ideas? Since the number of Indels in WES data is not that high, I am trying to decide if it is worthwhile to check this out with WGS data (when everything takes 100 times longer).

Your feedback is highly appreciated, Ben

chapmanb commented 5 years ago

Ben; Brilliant experiment, I haven't tried varying read lengths like this but would also like to copy in @naumenko-sa who has been doing some incredibly helpful experimentation with decoys (#2489) and might have looked at depth/read-size as well.

In what way to longer read GATK4 indels suffer: both sensitivity and specificity, or just in one metric? A nice way to move forward would be to identify some cases where shorter GATK4 and longer FreeBayes call NA12878 TPs that longer GATK4 misses. If we could put together some small example cases, the GATK4 team might be able to explain what we're seeing and confirm/deny your ideas about read length and filtering. If GATK4 is detecting the variants but they're soft filtered, maybe we could use this as an approach to be less strict and tweak filters with higher average depth.

Thanks again for this work and definitely happy to help figure out how we can make bcbio do better by default in these cases.

naumenko-sa commented 5 years ago

Hi Ben!

Those are interesting validation results! Could you please post your measurements (FDR and FNR%) and your bcbio config file? That will help to discuss them. I think you are right that the particular hard filters might influence precision/sensitivity - they are different for gatk4 and freebayes.

I have not done any validations with different read length and different coverage, but one of my colleagues did some validations varying coverage in gatk4. So if you post your results, we may try to review and give some feedback.

gatk4 is really fast - so you can go for WGS, or limit your analysis to one chromosome to get the first approximation. freebayes is not :), correct me, if I'm wrong, but, probably, it is not a practical choice for WGS variant calling.

Sergey

gabeng commented 5 years ago

Hi Sergey, hi Brad,

I am happy to share my data with you: I started with a single WES data set from NA12878 sequenced at 2x151 and 139 avg coverage (sample name 151). The average insert size is 242 bases in this data set (a broad distribution).

I created two additional data sets from the original one by trimming the reads to 101 and 76 bases, respectively (sample names 101 and 76). Average coverage decreases in these samples accordingly.
Subsequently, I subsampled 151 to match the average coverage of 101 and 76, respectively (sample names 76_sub and 101_sub).
Variant calling was limited to the BED file of the manufacturer (no padding, hence the low number of variants). After a first run I created the intersection of the callable.bed of 151, 101, 76, used this as validate_regions and performed alignment and variant calling with GATK4 and Freebayes.

Just a side note: Is there any way to better control in which regions the validation is performed? In this experiment it would have been very useful to keep these regions constant no matter what the coverage.

I also tried more aggressive filtering of 151 for adapter content and low-quality bases at the tail. This only decreased sensitivity.

These are the results for GATK4:

Strikingly, the number of TP Indels decreases with longer reads. The absolute numbers are small, so I really need to repeat this with WGS data.

Sample name	151	101_sub	101	76_sub	76
Cycles	2x151	2x151	2x101	2x151	2x76
Avg coverage	139	105	103	81	80
Variant Caller	GATK4	GATK4	GATK4	GATK4	GATK4
SNV TP	23041	23034	23025	23022	23006
SNV FP	83	84	94	90	123
SNV FN	43	50	59	62	78
Indel TP	967	965	978	966	979
Indel FP	190	189	193	180	200
Indel FN	105	107	94	106	93

The fn.vcf.gz from vcf-eval shows 21 soft-filtered variants (9 soft-filtered SNV). I was mistaken in my earlier statement that 50% of Indel FN were soft-filtered.

These are the results for Freebayes:

Freebayes shows a nice gradient towards longer reads and higher coverage.

Sample name	151	101_sub	101	76_sub	76
Cycles	2x151	2x151	2x101	2x151	2x76
Avg coverage	139	105	103	81	80
Variant Caller	Freebayes	Freebayes	Freebayes	Freebayes	Freebayes
SNV TP	23066	23062	23062	23050	23046
SNV FP	251	250	222	251	208
SNV FN	18	22	22	34	38
Indel TP	950	942	919	937	871
Indel FP	67	72	61	80	61
Indel FN	122	130	153	135	201

It's clearer in a picture:

fdr-fnr

This is the bcbio.yaml:

---
details:
  - analysis: variant2
    genome_build: GRCh37
    algorithm:
      aligner: bwa
      align_split_size: false
      mark_duplicates: true
      recalibrate: false
      realign: false
      variantcaller: [gatk-haplotype,freebayes]
      tools_off: [gemini]
      effects: false

Software versions are:

bcbio-nextgen,1.1.3a0-848e8de8
freebayes,1.1.0.46
gatk4,4.0.11.0

There is a public data set on basespace (NovaSeq S2: TruSeq DNA Enrichment (IDT xGen Exome Research Panel)) which shows the same behaviour but which has a quite short insert size of 156 bases.

Do you know of any high-coverage 2x151 WGS data sets of GiaB samples other than NA12878 and having a large insert size at the same time? I do not feel comfortable always using NA12878.

chapmanb commented 5 years ago

Ben; This is really great work, thank you for sharing. In my mind the next step would be to look in depth at the 11 and 12 indels you capture with trimmed reads and miss in the longer original reads. It's small enough that you might be able to identify patterns and determine if this is due to coverage, detection or filtering. That could give us an idea of how to tweak to improve sensitivity.

For your questions:

We don't have a way to not use callable regions in the automated validation. If you want to only look at regions with good coverage in everything, you could take the intersection of all the callable regions and variant regions BEDs, then use this as variant_regions for the run.
For other datasets, I'm not sure about insert size but we have a bunch of different germline validation input datasets from the other GiaB samples collected here: https://github.com/bcbio/bcbio_validation_workflows

Thank you again for this discussion.

gabeng commented 5 years ago

Brad, I am attaching the VCF file that contains the FN variants that are FN in 2x151, but not FN in 2x101. It is those where a) the genotype is wrong or b) the soft-filter is set.

Many of them are in repetitive regions. My understanding of the GATK annotations is limited. The GATKCutoffIndel all seem to be filtered because of low AF. Two of the SNVs seem to be filtered by MQRankSum. Otherwise I do not see a pattern there. Any other ideas? 151_fn_annotated_not_101.vcf.gz

Thanks for the link to the data descriptions. I should have read this more carefully. I will try a portion of the NA24631 300x genome 2x250 next. I could then do an additional step in trimming (2x250, 2x150,...).

naumenko-sa commented 5 years ago

Hi Ben and Brad!

Ben, thanks for sharing your results! It motivated me to update my validations of WES!

Data:

Sample name	NA12878
Cycles	2x100
Avg coverage for Ensembl protein coding exons	156
Median bp coverage for Ensembl protein coding exons	162
Reads (single)	142,641,214
Insert size	190

Methods: I intersect callable regions from bcbio x giab bed file x capture kit bed file = 50,436,223 bp in my case. It includes some non-coding sites, coding only should be 30M. Then I use RTG to report baseline-TP, FP, FN. My bcbio config:

details:
- algorithm:
    aligner: bwa
    effects: false
    mark_duplicates: true
    realign: false
    recalibrate: false
    remove_lcr: false
    save_diskspace: true
    tools_off:
    - vqsr
    tools_on:
    - noalt_calling
    variantcaller:
    - gatk-haplotype
    - freebayes
    validate: giab-NA12878/truth_small_variants.vcf.gz
    validate_regions: giab-NA12878/truth_regions.bed

Results (I combined with your results for better comparison):

SNPs:

Dataset	NA12878_100	NA12878_100	101_sub	101	NA12878_100	101_sub	101
variant caller	gatk 4.0.12.0	gatk 4.0.1.2	gatk 4.0.11.0	gatk 4.0.11.0	freebayes 1.1.0.46	freebayes 1.1.0.46	freebayes 1.1.0.46
tp	38,614	38,627	23,034	23,025	38,649	23062	23062
fp	203	211	84	94	285	250	222
fn	129	116	50	59	94	22	22
FDR	0.52%	0.54%	0.36%	0.41%	0.73%	1.07%	0.95%
FNR	0.33%	0.30%	0.22%	0.26%	0.24%	0.10%	0.10%

Indels

Dataset	NA12878_100	NA12878_100	101_sub	101	NA12878_100	101_sub	101
variant caller	gatk 4.0.12.0	gatk 4.0.1.2	gatk 4.0.11.0	gatk 4.0.11.0	freebayes 1.1.0.46	freebayes 1.1.0.46	freebayes 1.1.0.46
tp	2874	2883	965	966	2776	942	919
fp	364	384	189	180	238	72	61
fn	170	161	107	106	268	130	153
FDR	11.24%	11.75%	16.38%	15.71%	7.90%	7.10%	6.22%
FNR	5.58%	5.29%	9.98%	9.89%	8.80%	12.13%	14.27%
Target	38743	38743	23084	23084	38743	23084	23084

Conclusions (IMHO):

30X WGS with larger insert size (400) gives a very good improvement for indel FDR and FNR ~1% compared to >100X WES with IS 200 where FDR/FNR ~ 5-15% (see the comparison with WGS validation https://github.com/bcbio/bcbio-nextgen/issues/2489)
our results for SNPs are concordant, given that I had a larger target, it resulted in slightly higher FNR/FDR
our results on Indels (GATK FDR and FNR, Freebayes FNR) are discordant: your FDR/FNR are higher even with the smaller target in your experiment.
- What is your QC on base quality? Is it dropping fast towards the end of the read?
- I might suggest to investigate the dataset effect on Indels FNR/FDR. Probably it is a dataset effect rather than GATK effect, or at least GATK effect would be smaller on magnitude after dataset effect correction?
- If the dataset effect is not the cause of the higher FDR/FNR in your experiment for indels, another factor that might influence low GATK performance in indels is VQSR, see my WGS validation. Would you like to try tools_off: vqsr and compare?
probably you forgot to upgrade bcbio tools, as I'm on bcbio 1.1.2 and have gatk 4.0.12.0, and you are on bcbio 1.1.3 and gatk 4.0.11.0. But I think it does not influence the results - see my comparison with gatk 4.0.1.2.
sensitivity and specificity are tuned differently in GATK and Freebayes. We can benefit from combining calls with ensemble approach in bcbio.

Sergey

gabeng commented 5 years ago

Sergey,

the difference for 2x100 Indel calling is most peculiar. However, I have seen stranger things when using subsets of NA12878. I don't have a good feeling staring at the same 100 Indels all the time. I am getting ready with some chr20 WGS data from other giab samples.

Basecall quality was ok in my view .
I had thought that in runs with few samples, tools_off: vqsr was the default. Is that not the case? Sure I will try that if it's not off!

These are my datasets that I can make available for download for a few days: https://s3-eu-west-1.amazonaws.com/gatk-validation/ExomeNA12878_S1_R1_001.fastq.gz https://s3-eu-west-1.amazonaws.com/gatk-validation/ExomeNA12878_S1_R2_001.fastq.gz

This is the manufacturer's bed file: https://s3-eu-west-1.amazonaws.com/gatk-validation/enrichment_kit_3.bed

This is the bed file used as validate_regions which was created from the intersection of callable_regions of the 2x151, 2x101 and 2x76 data sets. https://s3-eu-west-1.amazonaws.com/gatk-validation/ExomeNA12878_intersection.bed

Are you using the 1.1.2 release version? I will check that out, too.

gabeng commented 5 years ago

Sergey,

I repeated the experiment using NA12878 300x WGS data from GIAB cut down to chr20:

Variant Caller	gatk-haplotype	gatk-haplotype	freebayes	freebayes
Cycles	2x148	2x101	2x148	2x101
Panel	WGS chr20	WGS chr20	WGS chr20	WGS chr20
Sample	NA12878	NA12878	NA12878	NA12878
SNV TP	66115	66171	66256	66224
SNV FP	42	50	610	568
SNV FN	182	173	41	120
SNV FDR	0.06%	0.08%	0.91%	0.85%
SNV FNR	0.27%	0.26%	0.06%	0.18%
Indel TP	9989	9972	9663	9355
Indel FP	48	71	316	321
Indel FN	25	53	351	670
Indel FDR	0.48%	0.71%	3.17%	3.32%
Indel FNR	0.25%	0.53%	3.51%	6.68%

It is amazing what the homogeneous, unbiased WGS coverage is worth (compared to targeted sequencing). Still, @naumenko-sa, you achieve better performance on your exome data. Would you be able to provide me with the raw data from your exome? I would like to make sure that this really is an issue of target selection.

@chapmanb: Was your Exome comparison derived from real WES data? Could you provide some details about the data sets (coverage, insert size etc.)?

-- Ben

gabeng commented 5 years ago

Sergey,

tools_off: vqsr did not change a single digit in my results. I therefore assume that it was off all along.

naumenko-sa commented 5 years ago

Hi Ben!

Thanks for sharing WGS results. Yes, it is encouraging that WGS improves indel calling that much.

I'm using bcbio1.1.2.

Unfortunately, I'm not allowed to share raw WES data of NA12878. We ruled out vqsr, and you have good sequencing quality. The only explanation of lower FNR/FDR in your experiment, that I might think of, is coverage: you have 100X and in my experiment it was 150X.

Sergey

gabeng commented 5 years ago

Sergey,

the avg coverage in my data set was 148x. So I guess it is the selection of target region (the IDT kit is focused on CCDS almost exclusivley). I was wondering how @chapmanb was able to obtain so excellent results with the IDT Exome kit. I guess it must have been much higher coverage. Anyhow, thank you for sharint your results in comparison.

chapmanb commented 5 years ago

Ben and Sergey; Thanks for this great discussion. You're definitely right that evenness of coverage can help with resolution, especially when you drill in on individual variants. Looking at the variants Ben posted above they appear to be in trickier regions and also right around the cutoffs. I often see this kind of small fluctuation dependent on depth, strand bias or other metrics where you're passing right near the cutoff on one sample set and then failing on the other. This comes from having a standard set of filters, either soft cutoffs or general VQSR training approaches, being not as good as customizing per sample set. On the other side, having a good way to customizing and tuning for each sample set is not easy either.

For the IDT exome capture, I'm afraid I don't have all the details for that in terms of coverage. That was an internal dataset, but I do know that generally IDT capture kits do quite well in validations, primarily due to having more even coverage across target regions. We investigated some and this is due to bait distribution and design, but we focused more on what is captured as kits often have variable targets and we wanted to identify specific regions we needed to see good coverage on.

Sorry this is all more general, but there are a lot of rabbit holes here where I've ended up making practical decisions rather than going all the way down. Hope this is helpful and thanks again for the discussion.

bcbio / bcbio-nextgen

GATK4 performance at higher depth/longer reads / Low freq allele filter #2623