bcbio / bcbio-nextgen

Validated, scalable, community developed variant calling, RNA-seq and small RNA analysis
https://bcbio-nextgen.readthedocs.io
MIT License
991 stars 354 forks source link

Many false positives in RNA-seq variant calling since introduction of gatk4 #2410

Closed naumenko-sa closed 4 years ago

naumenko-sa commented 6 years ago

Hey guys!

Thanks for the great pipeline!

Has anybody checked the precision of RNA-seq variant calling since the introduction of gatk4? Previously, with gatk 3x after rna-editing filtration it was pretty decent (precision=90-95%).

I have not done the NA12878 test for now, but the new output looks like having a crazy amount of low quality long indels.

Would it be possible to fix gatk4 parameters/filters, or at least turn it off in RNA-seq (to use gatk3.8 instead)?

I can run the NA12878 test if you need it.

Sergey

chapmanb commented 6 years ago

Sergey; Thanks much for testing this. We'd love to have formal testing in place for RNA-seq variant calling; currently we don't have any truth sets and establishing a NA12878-based standard would help a lot to test and improve methods. Are you able to share the test you're using? It would be really useful to understand if we're seeing a regression because of GATK4, and then can figure out how to fix/filter better, or if this is something data specific. We often see low quality long indels on samples that have chimeric issues at the ends of reads that get soft-filtered by bwa and used for calling. Thanks for starting this conversation and testing.

naumenko-sa commented 6 years ago

Hi Brad!

I am linking https://github.com/bcbio/bcbio-nextgen/issues/1502. And Piskol2013 article is also relevant.

I'm still using the same SRR307898 RNA-seq of GM12878, and validating with https://github.com/naumenko-sa/crt/blob/master/crt.validate.sh.

In November 2016 with gatk 3.6, I've got

snp tp-baseline 32610
indels tp-baseline 2893
snp fp 5400
indels fp 694
snp fn 3177640
indels fn 478013

so the precision was SNPs = 32610/(32610+5400) = 86% Indels = 2893/(2893+694) = 80%

Which is fine for RNA-seq, because the bed file used for the analysis was genome-wide, and if we use only well covered regions, we would have higher precision. If we restrict the analysis to protein coding exons only, the result would be

snp tp-baseline 6655
indels tp-baseline 81
snp fp 160
indels fp 75
snp fn 3203595
indels fn 480825

precision SNPs = 6655/(160+6655) = 98% Indels = 81/(81+75) = 52%

with GATK4, same procedure, same input data:

snp tp-baseline 5835
indels tp-baseline 83
snp fp 1022
indels fp 12439
snp fn 3204415
indels fn 480823

precision SNPs = 5835/(1022+5835) = 85% Indels = 83/(83+12439) = 0.7%

I'm ready to participate in figuring out, why gatk4 produces such results, but would it be possible in the meanwhile to introduce tools_off: gatk4 option for rna-seq like in variant2 analysis?

Adding also the config file - no bwa!

details:
- algorithm:
    aligner: star
    variantcaller: gatk-haplotype
  analysis: RNA-seq
  description: NA12878_SRR307898
  files:
  - /hpf/largeprojects/ccmbio/naumenko/validation/rnaseq/NA12878/input/NA12878_SRR307898_1.fq.gz
  - /hpf/largeprojects/ccmbio/naumenko/validation/rnaseq/NA12878/input/NA12878_SRR307898_2.fq.gz
  genome_build: GRCh37
  metadata:
    batch: NA12878
fc_name: NA12878
upload:
  dir: ../final

Sergey

chapmanb commented 6 years ago

Sergey; Thanks for these numbers, the sensitivity with GATK4 is way down. That's unexpected and I'd love to figure out what is wrong. Thanks also for the pointer to the input data. It's getting a bit dated in terms of sequencing technology (Illumina GAII sequencing) but is a great starting point for building a validation dataset for this, thank you.

Practically, tools_off: [gatk4] should work for RNA-seq variant calling right now, so you could run like that to confirm that it's a GATK3 versus GATK4 difference, rather than something else in the bcbio pipeline. You should also be able to reduce the false negatives by only doing the rtg comparison over the intersection of the GiaB truth regions and exome BED.

I'll work on putting together a validation workflow for this in parallel and we can definitely work to iterate and improve. Thank you again for this.

naumenko-sa commented 6 years ago

Thanks Brad!

Variant calling with gatk3.8 went well, and then it failed during the variant filtration step: it tried to use gatk4 for that. I remember there was a similar error in variant2 right after introducing tools_off: gatk4, but I cannot locate this past issue.

[2018-06-07T11:18Z] multiprocessing: run_rnaseq_ann_filter
[2018-06-07T11:18Z] Annotating /hpf/largeprojects/ccmbio/naumenko/validation/rnaseq/NA12878/work/joint/gatk-haplotype-joint/NA12878/NA12878-joint.vcf.gz with vcfanno, using /hpf/largeprojects/ccmbio/naumenko/tools/bcbio/genomes/Hsapiens/GRCh37/config/vcfanno/rnaedit.conf
[2018-06-07T11:18Z] =============================================
[2018-06-07T11:18Z] vcfanno version 0.2.8 [built with go1.8]
[2018-06-07T11:18Z] see: https://github.com/brentp/vcfanno
[2018-06-07T11:18Z] =============================================
[2018-06-07T11:18Z] vcfanno.go:115: found 1 sources from 1 files
[2018-06-07T11:18Z] vcfanno.go:241: annotated 178626 variants in 11.41 seconds (15659.5 / second)
[2018-06-07T11:18Z] tabix index NA12878-joint-annotated-rnaedit.vcf.gz
[2018-06-07T11:18Z] Filter RNA-seq variants.
[2018-06-07T11:18Z] Using GATK jar /hpf/largeprojects/ccmbio/naumenko/tools/bcbio/anaconda/share/gatk4-4.0.1.2-0/gatk-package-4.0.1.2-local.jar
[2018-06-07T11:18Z] Running:
[2018-06-07T11:18Z]     java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=1 -Xms500m -Xmx3500m -XX:+UseSerialGC -Djava.io.tmpdir=/hpf/largeprojects/ccmbio/naumenko/validation/rnaseq/NA12878/work/bcbiotx/tmpJUhZ5s -U LENIENT_VCF_PROCESSING --read_filter BadCigar --read_filter NotPrimaryAlignment -jar /hpf/largeprojects/ccmbio/naumenko/tools/bcbio/anaconda/share/gatk4-4.0.1.2-0/gatk-package-4.0.1.2-local.jar VariantFiltration -R /hpf/largeprojects/ccmbio/naumenko/tools/bcbio/genomes/Hsapiens/GRCh37/seq/GRCh37.fa -V /hpf/largeprojects/ccmbio/naumenko/validation/rnaseq/NA12878/work/joint/gatk-haplotype-joint/NA12878/NA12878-joint-annotated-rnaedit.vcf.gz --cluster-window-size 35 --cluster-size 3 --filter-expression FS > 30.0 --filter-name FS --filter-expression QD < 2.0 --filter-name QD --output /hpf/largeprojects/ccmbio/naumenko/validation/rnaseq/NA12878/work/bcbiotx/tmpJUhZ5s/NA12878-joint-annotated-rnaedit-filter.vcf.gz
[2018-06-07T11:18Z] Unrecognized option: -U
[2018-06-07T11:18Z] Error: Could not create the Java Virtual Machine.
[2018-06-07T11:18Z] Error: A fatal exception has occurred. Program will exit.
[2018-06-07T11:18Z] Uncaught exception occurred

Could you please fix this? Thanks! Sergey

chapmanb commented 6 years ago

Sergey; Sorry about this issue. We should be using GATK4 here since we use GATK4 to replace more standard steps like filtering and other VCF processing. We were inconsistently passing tools_on between calculating the tool to run and the parameters, causing the bug here. I pushed a fix that should resolve it and correctly use GATK4 so if you update to the latest development and test in place it should hopefully wokr correctly now. Thanks again for debugging and letting us know about the issue.

naumenko-sa commented 6 years ago

Hi Brad!

Unfortunately, I've go the same error after the upgrade (also with a fresh run - I cleaned up work dir).

bcbio_nextgen.py --version
1.1.0a0

Error:

[2018-06-10T11:54Z] vcfanno version 0.2.8 [built with go1.8]
[2018-06-10T11:54Z] see: https://github.com/brentp/vcfanno
[2018-06-10T11:54Z] =============================================
[2018-06-10T11:54Z] vcfanno.go:115: found 1 sources from 1 files
[2018-06-10T11:54Z] vcfanno.go:143: using 2 worker threads to decompress query file
[2018-06-10T11:54Z] vcfanno.go:241: annotated 178620 variants in 3.29 seconds (54332.3 / second)
[2018-06-10T11:54Z] tabix index NA12878-joint-annotated-rnaedit.vcf.gz
[2018-06-10T11:54Z] Filter RNA-seq variants.
[2018-06-10T11:54Z] Using GATK jar /hpf/largeprojects/ccmbio/naumenko/tools/bcbio/anaconda/share/gatk4-4.0.1.2-0/gatk-package-4.0.1.2-local.jar
[2018-06-10T11:54Z] Running:
[2018-06-10T11:54Z]     java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=1 -Xms500m -Xmx3500m -XX:+UseSerialGC -Djava.io.tmpdir=/hpf/largeprojects/ccmbio/naumenko/validation/rnaseq/NA12878/work/bcbiotx/tmpAfBYlX -U LENIENT_VCF_PROCESSING --read_filter BadCigar --read_filter NotPrimaryAlignment -jar /hpf/largeprojects/ccmbio/naumenko/tools/bcbio/anaconda/share/gatk4-4.0.1.2-0/gatk-package-4.0.1.2-local.jar VariantFiltration -R /hpf/largeprojects/ccmbio/naumenko/tools/bcbio/genomes/Hsapiens/GRCh37/seq/GRCh37.fa -V /hpf/largeprojects/ccmbio/naumenko/validation/rnaseq/NA12878/work/joint/gatk-haplotype-joint/NA12878/NA12878-joint-annotated-rnaedit.vcf.gz --cluster-window-size 35 --cluster-size 3 --filter-expression FS > 30.0 --filter-name FS --filter-expression QD < 2.0 --filter-name QD --output /hpf/largeprojects/ccmbio/naumenko/validation/rnaseq/NA12878/work/bcbiotx/tmpAfBYlX/NA12878-joint-annotated-rnaedit-filter.vcf.gz
[2018-06-10T11:54Z] Unrecognized option: -U
[2018-06-10T11:54Z] Error: Could not create the Java Virtual Machine.
[2018-06-10T11:54Z] Error: A fatal exception has occurred. Program will exit.
[2018-06-10T11:54Z] Uncaught exception occurred

SN

chapmanb commented 6 years ago

Sergey; Sorry about the continued problems. That's strange, as I can't reproduce your error with this fix in place (but could previously). I noticed that you had an older version of GATK (4.0.1.2) where we should have 4.0.3.0. Is it possible something went wrong during the upgrade and we don't have the latest 1.1.0a0? What does provenance/programs.txt report for the git hash for bcbio-nextgen?

If it's a hash past the fix, could you share your YAML file and I can work to reproduce? Thanks much for the help debugging this.

naumenko-sa commented 6 years ago

Hi Brad!

Thanks for debugging this!

I've upgraded with -u development one more time, then upgraded tools. Unfortunately, I still have the error. configuration:

details:
- algorithm:
    aligner: star
    variantcaller: gatk-haplotype
    tools_off:
    - gatk4
  analysis: RNA-seq
  description: NA12878_SRR307898
  files:
  - /hpf/largeprojects/ccmbio/naumenko/validation/rnaseq/NA12878/input/NA12878_SRR307898_1.fq.gz
  - /hpf/largeprojects/ccmbio/naumenko/validation/rnaseq/NA12878/input/NA12878_SRR307898_2.fq.gz
  genome_build: GRCh37
  metadata:
    batch: NA12878
fc_name: NA12878
upload:
  dir: ../final

programs:

bcbio-nextgen,1.1.0a0
bcbio-variation,0.2.6
gatk,3.8
gatk4,4.0.3.0

Error:

[2018-06-11T20:58Z] Filter RNA-seq variants.
[2018-06-11T20:58Z] Using GATK jar /hpf/largeprojects/ccmbio/naumenko/tools/bcbio/anaconda/share/gatk4-4.0.3.0-0/gatk-package-4.0.3.0-local.jar
[2018-06-11T20:58Z] Running:
[2018-06-11T20:58Z]     java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -Xms500m -Xmx3500m -XX:+UseSerialGC -Djava.io.tmpdir=/hpf/largeprojects/ccmbio/naumenko/validation/rnaseq/NA12878/work/bcbiotx/tmp4AJOIY -U LENIENT_VCF_PROCESSING --read_filter BadCigar --read_filter NotPrimaryAlignment -jar /hpf/largeprojects/ccmbio/naumenko/tools/bcbio/anaconda/share/gatk4-4.0.3.0-0/gatk-package-4.0.3.0-local.jar VariantFiltration -R /hpf/largeprojects/ccmbio/naumenko/tools/bcbio/genomes/Hsapiens/GRCh37/seq/GRCh37.fa -V /hpf/largeprojects/ccmbio/naumenko/validation/rnaseq/NA12878/work/joint/gatk-haplotype-joint/NA12878/NA12878-joint-annotated-rnaedit.vcf.gz --cluster-window-size 35 --cluster-size 3 --filter-expression FS > 30.0 --filter-name FS --filter-expression QD < 2.0 --filter-name QD --output /hpf/largeprojects/ccmbio/naumenko/validation/rnaseq/NA12878/work/bcbiotx/tmp4AJOIY/NA12878-joint-annotated-rnaedit-filter.vcf.gz
[2018-06-11T20:58Z] Unrecognized option: -U
[2018-06-11T20:58Z] Error: Could not create the Java Virtual Machine.
[2018-06-11T20:58Z] Error: A fatal exception has occurred. Program will exit.
[2018-06-11T20:58Z] Uncaught exception occurred
chapmanb commented 6 years ago

Sergey; I'm sorry about these continued problems, I have no idea why the fixed code is not working and keep trying to replicate without any luck. So I'm very confused. Could you do me a favor and look at the the bcbio/variation/rnaseq.py reference, which should appear just before the error you posted in your log/bcbio-nextgen-debug.log file? It would be helpful to confirm the update is happening and you have the fixed code:

https://github.com/bcbio/bcbio-nextgen/commit/37b122ae15018772b270909d3f4a743eb6bbceee

Sorry I wish I had some other explanation other than "it should work" but hopefully this helps identify what is going on.

naumenko-sa commented 6 years ago

Hi Brad!

Thanks for you time - the problem is resolved. Maybe the old script was cached somewhere in our system. I'm not sure why it did not work after the upgrade. Sorry about your time spend on that. After a couple of upgrades it works.

The validation results with gatk3.8 (same data):

Metrics gatk3.6-Nov2016 gatk4 gatk3.8
snp tp-baseline 6655 5835 6658
snp fp 186 160 1022 89
SNP FDR 2% 15% 1.3%
indels tp-baseline 81 83 89
indels fp 75 12439 168
Indels FDR 48% 99% 65%

Now we are doing pretty well in SNP calling with RNA-seq (given the limited coverage of RNAseq). I would not recommend to call indels with RNA-seq, but SNP calling precision (where RNA-seq has coverage) is comparable to WES.

I will keep this issue open and will try to investigate why gatk4 is doing so badly in RNA-seq SNP calling.

Sergey

naumenko-sa commented 5 years ago

Hey @chapmanb and @roryk !

I'm still puzzled by the evolution of RNA-seq small variant calling precision.

Below I reviewed all validations I had for RNA-seq, making sure I am using the same bed file, same filters, same validation procedure.

https://github.com/naumenko-sa/crt/tree/master/validation

Please let me know if you have any ideas where to dig further to find the cause of this decline of precision/sensitivity: the best result is from 2016 with bcbio 1.0.0. I have all the data - vcfs for FP, FN, TP.

One point might be the old data (76bp Illumina GAII), i.e. tools got (implicitly?) optimized for ~100 bp reads widely used during last 5 years. I used that to be compatible with the only article in the field (Piskol 2013). Let me know if you aware of better RNA-seq dataset for NA12878.

Searching GATK website did not bring much result: they acknowledge they don't have validation of gatk4. Also, they switched to WDL workflow. https://github.com/gatk-workflows/gatk3-4-rnaseq-germline-snps-indels/blob/master/rna-germline-variant-calling.wdl.

There is bcbio-related effect: with the same gatk3.8 in bcbio 1.1.3 for SNPs is FDR is 8% while in 1.0.9 it is 4%.

Sergey

roryk commented 5 years ago

Thanks Sergey,

Thanks for the detective work. I don't have any great suggestions beyond what you've looked at. For the last point, between 1.0.9 and 1.1.14 there was a bunch of work on the variant calling which may have inadvertently made it worse. It got hooked into the the parallelization framework to call on regions, which I don't get why that would matter. I also added filtering out variants close to splice junctions but that should lower the FDR not raise it. Is the STAR version the same? STAR has undergone a bunch of changes, but I'm not sure if that falls into the same time window. I have been thinking that implementing STAR 2-pass might be helpful when calling variants on RNA-seq data, but I haven't done it yet.

TManning7 commented 5 years ago

Hi there,

I'm a budding young Master's student who's supervisor is impressed by the work the BC Bio team has been doing. As far as I know, there are still no truth sets for RNA-sea data, so I'm wondering how I would go about validating the precision of my own pipeline, as seems to be happening here for BC Bio.

You'll have to forgive my ignorance if this seems a silly question, I haven't been working with bioinformatics for very long and it's sometimes hard for me to parse all of the information that's out there. Could I use some combination of existing truth sets in the platinum genomes and exomes to help me assess my own pipeline?

I really appreciate your time and understand that it might not be a priority to answer my question here, in-case it's not the best place to ask it. Keep up the good work!

roryk commented 5 years ago

Hi @TManning7!

You are right that there aren't great truth sets for RNA-seq data-- for a while folks were comparing to qPCR sets and what not, but then those are assuming the qPCR is correct. For RNA-seq data, looking at expression, we use simulated data since you can generate an actual truth set. Here is a simple example dataset we and other groups often test against:

https://www.ebi.ac.uk/arrayexpress/experiments/E-MTAB-4119

That would be a good place to start if you are looking to see how well your pipeline is doing on RNA-seq data. There have been a few publications that have used it as well.

Hope that helps some!

roryk commented 5 years ago

Oh, I just realized this is in the RNA-seq variant calling thread. If you're looking to test RNA-seq variant calls, Sergey has a really nice writeup on an old NA12878 dataset: https://github.com/naumenko-sa/crt/tree/master/validation

TManning7 commented 5 years ago

Thank you so much!

roryk commented 5 years ago

Happy to help! We have a lot of nice short course materials here https://github.com/hbctraining/Training-modules that might be useful, if you are working on doing RNA-seq expression type experiments. I think they are an awesome introduction to doing analyses. We also have longer course materials here: https://github.com/hbctraining/In-depth-NGS-Data-Analysis-Course that is more in depth than the small workshops. Good luck!

komalsrathi commented 4 years ago

Hello everyone,

Can someone advise me if it is still recommended to use GATK3.8 for RNA-seq variant calling instead of GATK4?

Thanks Komal

naumenko-sa commented 4 years ago

Hi!

Yes, I'm currently refreshing validations of it and even can't get gatk4 to call variants. Follow the user story here https://bcbio-nextgen.readthedocs.io/en/latest/contents/rnaseq_variants.html

S.

naumenko-sa commented 4 years ago

closing, following up in #3161