Closed naumenko-sa closed 4 years ago
Sergey; Thanks much for testing this. We'd love to have formal testing in place for RNA-seq variant calling; currently we don't have any truth sets and establishing a NA12878-based standard would help a lot to test and improve methods. Are you able to share the test you're using? It would be really useful to understand if we're seeing a regression because of GATK4, and then can figure out how to fix/filter better, or if this is something data specific. We often see low quality long indels on samples that have chimeric issues at the ends of reads that get soft-filtered by bwa and used for calling. Thanks for starting this conversation and testing.
Hi Brad!
I am linking https://github.com/bcbio/bcbio-nextgen/issues/1502. And Piskol2013 article is also relevant.
I'm still using the same SRR307898 RNA-seq of GM12878, and validating with https://github.com/naumenko-sa/crt/blob/master/crt.validate.sh.
In November 2016 with gatk 3.6, I've got
snp tp-baseline 32610
indels tp-baseline 2893
snp fp 5400
indels fp 694
snp fn 3177640
indels fn 478013
so the precision was SNPs = 32610/(32610+5400) = 86% Indels = 2893/(2893+694) = 80%
Which is fine for RNA-seq, because the bed file used for the analysis was genome-wide, and if we use only well covered regions, we would have higher precision. If we restrict the analysis to protein coding exons only, the result would be
snp tp-baseline 6655
indels tp-baseline 81
snp fp 160
indels fp 75
snp fn 3203595
indels fn 480825
precision SNPs = 6655/(160+6655) = 98% Indels = 81/(81+75) = 52%
with GATK4, same procedure, same input data:
snp tp-baseline 5835
indels tp-baseline 83
snp fp 1022
indels fp 12439
snp fn 3204415
indels fn 480823
precision SNPs = 5835/(1022+5835) = 85% Indels = 83/(83+12439) = 0.7%
I'm ready to participate in figuring out, why gatk4 produces such results, but would it be possible in the meanwhile to introduce tools_off: gatk4 option for rna-seq like in variant2 analysis?
Adding also the config file - no bwa!
details:
- algorithm:
aligner: star
variantcaller: gatk-haplotype
analysis: RNA-seq
description: NA12878_SRR307898
files:
- /hpf/largeprojects/ccmbio/naumenko/validation/rnaseq/NA12878/input/NA12878_SRR307898_1.fq.gz
- /hpf/largeprojects/ccmbio/naumenko/validation/rnaseq/NA12878/input/NA12878_SRR307898_2.fq.gz
genome_build: GRCh37
metadata:
batch: NA12878
fc_name: NA12878
upload:
dir: ../final
Sergey
Sergey; Thanks for these numbers, the sensitivity with GATK4 is way down. That's unexpected and I'd love to figure out what is wrong. Thanks also for the pointer to the input data. It's getting a bit dated in terms of sequencing technology (Illumina GAII sequencing) but is a great starting point for building a validation dataset for this, thank you.
Practically, tools_off: [gatk4]
should work for RNA-seq variant calling right now, so you could run like that to confirm that it's a GATK3 versus GATK4 difference, rather than something else in the bcbio pipeline. You should also be able to reduce the false negatives by only doing the rtg comparison over the intersection of the GiaB truth regions and exome BED.
I'll work on putting together a validation workflow for this in parallel and we can definitely work to iterate and improve. Thank you again for this.
Thanks Brad!
Variant calling with gatk3.8 went well, and then it failed during the variant filtration step: it tried to use gatk4 for that. I remember there was a similar error in variant2 right after introducing tools_off: gatk4, but I cannot locate this past issue.
[2018-06-07T11:18Z] multiprocessing: run_rnaseq_ann_filter
[2018-06-07T11:18Z] Annotating /hpf/largeprojects/ccmbio/naumenko/validation/rnaseq/NA12878/work/joint/gatk-haplotype-joint/NA12878/NA12878-joint.vcf.gz with vcfanno, using /hpf/largeprojects/ccmbio/naumenko/tools/bcbio/genomes/Hsapiens/GRCh37/config/vcfanno/rnaedit.conf
[2018-06-07T11:18Z] =============================================
[2018-06-07T11:18Z] vcfanno version 0.2.8 [built with go1.8]
[2018-06-07T11:18Z] see: https://github.com/brentp/vcfanno
[2018-06-07T11:18Z] =============================================
[2018-06-07T11:18Z] vcfanno.go:115: found 1 sources from 1 files
[2018-06-07T11:18Z] vcfanno.go:241: annotated 178626 variants in 11.41 seconds (15659.5 / second)
[2018-06-07T11:18Z] tabix index NA12878-joint-annotated-rnaedit.vcf.gz
[2018-06-07T11:18Z] Filter RNA-seq variants.
[2018-06-07T11:18Z] Using GATK jar /hpf/largeprojects/ccmbio/naumenko/tools/bcbio/anaconda/share/gatk4-4.0.1.2-0/gatk-package-4.0.1.2-local.jar
[2018-06-07T11:18Z] Running:
[2018-06-07T11:18Z] java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=1 -Xms500m -Xmx3500m -XX:+UseSerialGC -Djava.io.tmpdir=/hpf/largeprojects/ccmbio/naumenko/validation/rnaseq/NA12878/work/bcbiotx/tmpJUhZ5s -U LENIENT_VCF_PROCESSING --read_filter BadCigar --read_filter NotPrimaryAlignment -jar /hpf/largeprojects/ccmbio/naumenko/tools/bcbio/anaconda/share/gatk4-4.0.1.2-0/gatk-package-4.0.1.2-local.jar VariantFiltration -R /hpf/largeprojects/ccmbio/naumenko/tools/bcbio/genomes/Hsapiens/GRCh37/seq/GRCh37.fa -V /hpf/largeprojects/ccmbio/naumenko/validation/rnaseq/NA12878/work/joint/gatk-haplotype-joint/NA12878/NA12878-joint-annotated-rnaedit.vcf.gz --cluster-window-size 35 --cluster-size 3 --filter-expression FS > 30.0 --filter-name FS --filter-expression QD < 2.0 --filter-name QD --output /hpf/largeprojects/ccmbio/naumenko/validation/rnaseq/NA12878/work/bcbiotx/tmpJUhZ5s/NA12878-joint-annotated-rnaedit-filter.vcf.gz
[2018-06-07T11:18Z] Unrecognized option: -U
[2018-06-07T11:18Z] Error: Could not create the Java Virtual Machine.
[2018-06-07T11:18Z] Error: A fatal exception has occurred. Program will exit.
[2018-06-07T11:18Z] Uncaught exception occurred
Could you please fix this? Thanks! Sergey
Sergey;
Sorry about this issue. We should be using GATK4 here since we use GATK4 to replace more standard steps like filtering and other VCF processing. We were inconsistently passing tools_on
between calculating the tool to run and the parameters, causing the bug here. I pushed a fix that should resolve it and correctly use GATK4 so if you update to the latest development and test in place it should hopefully wokr correctly now. Thanks again for debugging and letting us know about the issue.
Hi Brad!
Unfortunately, I've go the same error after the upgrade (also with a fresh run - I cleaned up work dir).
bcbio_nextgen.py --version
1.1.0a0
Error:
[2018-06-10T11:54Z] vcfanno version 0.2.8 [built with go1.8]
[2018-06-10T11:54Z] see: https://github.com/brentp/vcfanno
[2018-06-10T11:54Z] =============================================
[2018-06-10T11:54Z] vcfanno.go:115: found 1 sources from 1 files
[2018-06-10T11:54Z] vcfanno.go:143: using 2 worker threads to decompress query file
[2018-06-10T11:54Z] vcfanno.go:241: annotated 178620 variants in 3.29 seconds (54332.3 / second)
[2018-06-10T11:54Z] tabix index NA12878-joint-annotated-rnaedit.vcf.gz
[2018-06-10T11:54Z] Filter RNA-seq variants.
[2018-06-10T11:54Z] Using GATK jar /hpf/largeprojects/ccmbio/naumenko/tools/bcbio/anaconda/share/gatk4-4.0.1.2-0/gatk-package-4.0.1.2-local.jar
[2018-06-10T11:54Z] Running:
[2018-06-10T11:54Z] java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=1 -Xms500m -Xmx3500m -XX:+UseSerialGC -Djava.io.tmpdir=/hpf/largeprojects/ccmbio/naumenko/validation/rnaseq/NA12878/work/bcbiotx/tmpAfBYlX -U LENIENT_VCF_PROCESSING --read_filter BadCigar --read_filter NotPrimaryAlignment -jar /hpf/largeprojects/ccmbio/naumenko/tools/bcbio/anaconda/share/gatk4-4.0.1.2-0/gatk-package-4.0.1.2-local.jar VariantFiltration -R /hpf/largeprojects/ccmbio/naumenko/tools/bcbio/genomes/Hsapiens/GRCh37/seq/GRCh37.fa -V /hpf/largeprojects/ccmbio/naumenko/validation/rnaseq/NA12878/work/joint/gatk-haplotype-joint/NA12878/NA12878-joint-annotated-rnaedit.vcf.gz --cluster-window-size 35 --cluster-size 3 --filter-expression FS > 30.0 --filter-name FS --filter-expression QD < 2.0 --filter-name QD --output /hpf/largeprojects/ccmbio/naumenko/validation/rnaseq/NA12878/work/bcbiotx/tmpAfBYlX/NA12878-joint-annotated-rnaedit-filter.vcf.gz
[2018-06-10T11:54Z] Unrecognized option: -U
[2018-06-10T11:54Z] Error: Could not create the Java Virtual Machine.
[2018-06-10T11:54Z] Error: A fatal exception has occurred. Program will exit.
[2018-06-10T11:54Z] Uncaught exception occurred
SN
Sergey;
Sorry about the continued problems. That's strange, as I can't reproduce your error with this fix in place (but could previously). I noticed that you had an older version of GATK (4.0.1.2) where we should have 4.0.3.0. Is it possible something went wrong during the upgrade and we don't have the latest 1.1.0a0? What does provenance/programs.txt
report for the git hash for bcbio-nextgen?
If it's a hash past the fix, could you share your YAML file and I can work to reproduce? Thanks much for the help debugging this.
Hi Brad!
Thanks for debugging this!
I've upgraded with -u development one more time, then upgraded tools. Unfortunately, I still have the error. configuration:
details:
- algorithm:
aligner: star
variantcaller: gatk-haplotype
tools_off:
- gatk4
analysis: RNA-seq
description: NA12878_SRR307898
files:
- /hpf/largeprojects/ccmbio/naumenko/validation/rnaseq/NA12878/input/NA12878_SRR307898_1.fq.gz
- /hpf/largeprojects/ccmbio/naumenko/validation/rnaseq/NA12878/input/NA12878_SRR307898_2.fq.gz
genome_build: GRCh37
metadata:
batch: NA12878
fc_name: NA12878
upload:
dir: ../final
programs:
bcbio-nextgen,1.1.0a0
bcbio-variation,0.2.6
gatk,3.8
gatk4,4.0.3.0
Error:
[2018-06-11T20:58Z] Filter RNA-seq variants.
[2018-06-11T20:58Z] Using GATK jar /hpf/largeprojects/ccmbio/naumenko/tools/bcbio/anaconda/share/gatk4-4.0.3.0-0/gatk-package-4.0.3.0-local.jar
[2018-06-11T20:58Z] Running:
[2018-06-11T20:58Z] java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -Xms500m -Xmx3500m -XX:+UseSerialGC -Djava.io.tmpdir=/hpf/largeprojects/ccmbio/naumenko/validation/rnaseq/NA12878/work/bcbiotx/tmp4AJOIY -U LENIENT_VCF_PROCESSING --read_filter BadCigar --read_filter NotPrimaryAlignment -jar /hpf/largeprojects/ccmbio/naumenko/tools/bcbio/anaconda/share/gatk4-4.0.3.0-0/gatk-package-4.0.3.0-local.jar VariantFiltration -R /hpf/largeprojects/ccmbio/naumenko/tools/bcbio/genomes/Hsapiens/GRCh37/seq/GRCh37.fa -V /hpf/largeprojects/ccmbio/naumenko/validation/rnaseq/NA12878/work/joint/gatk-haplotype-joint/NA12878/NA12878-joint-annotated-rnaedit.vcf.gz --cluster-window-size 35 --cluster-size 3 --filter-expression FS > 30.0 --filter-name FS --filter-expression QD < 2.0 --filter-name QD --output /hpf/largeprojects/ccmbio/naumenko/validation/rnaseq/NA12878/work/bcbiotx/tmp4AJOIY/NA12878-joint-annotated-rnaedit-filter.vcf.gz
[2018-06-11T20:58Z] Unrecognized option: -U
[2018-06-11T20:58Z] Error: Could not create the Java Virtual Machine.
[2018-06-11T20:58Z] Error: A fatal exception has occurred. Program will exit.
[2018-06-11T20:58Z] Uncaught exception occurred
Sergey;
I'm sorry about these continued problems, I have no idea why the fixed code is not working and keep trying to replicate without any luck. So I'm very confused. Could you do me a favor and look at the the bcbio/variation/rnaseq.py
reference, which should appear just before the error you posted in your log/bcbio-nextgen-debug.log
file? It would be helpful to confirm the update is happening and you have the fixed code:
https://github.com/bcbio/bcbio-nextgen/commit/37b122ae15018772b270909d3f4a743eb6bbceee
Sorry I wish I had some other explanation other than "it should work" but hopefully this helps identify what is going on.
Hi Brad!
Thanks for you time - the problem is resolved. Maybe the old script was cached somewhere in our system. I'm not sure why it did not work after the upgrade. Sorry about your time spend on that. After a couple of upgrades it works.
The validation results with gatk3.8 (same data):
Metrics | gatk3.6-Nov2016 | gatk4 | gatk3.8 |
---|---|---|---|
snp tp-baseline | 6655 | 5835 | 6658 |
snp fp 186 | 160 | 1022 | 89 |
SNP FDR | 2% | 15% | 1.3% |
indels tp-baseline | 81 | 83 | 89 |
indels fp | 75 | 12439 | 168 |
Indels FDR | 48% | 99% | 65% |
Now we are doing pretty well in SNP calling with RNA-seq (given the limited coverage of RNAseq). I would not recommend to call indels with RNA-seq, but SNP calling precision (where RNA-seq has coverage) is comparable to WES.
I will keep this issue open and will try to investigate why gatk4 is doing so badly in RNA-seq SNP calling.
Sergey
Hey @chapmanb and @roryk !
I'm still puzzled by the evolution of RNA-seq small variant calling precision.
Below I reviewed all validations I had for RNA-seq, making sure I am using the same bed file, same filters, same validation procedure.
https://github.com/naumenko-sa/crt/tree/master/validation
Please let me know if you have any ideas where to dig further to find the cause of this decline of precision/sensitivity: the best result is from 2016 with bcbio 1.0.0. I have all the data - vcfs for FP, FN, TP.
One point might be the old data (76bp Illumina GAII), i.e. tools got (implicitly?) optimized for ~100 bp reads widely used during last 5 years. I used that to be compatible with the only article in the field (Piskol 2013). Let me know if you aware of better RNA-seq dataset for NA12878.
Searching GATK website did not bring much result: they acknowledge they don't have validation of gatk4. Also, they switched to WDL workflow. https://github.com/gatk-workflows/gatk3-4-rnaseq-germline-snps-indels/blob/master/rna-germline-variant-calling.wdl.
There is bcbio-related effect: with the same gatk3.8 in bcbio 1.1.3 for SNPs is FDR is 8% while in 1.0.9 it is 4%.
Sergey
Thanks Sergey,
Thanks for the detective work. I don't have any great suggestions beyond what you've looked at. For the last point, between 1.0.9 and 1.1.14 there was a bunch of work on the variant calling which may have inadvertently made it worse. It got hooked into the the parallelization framework to call on regions, which I don't get why that would matter. I also added filtering out variants close to splice junctions but that should lower the FDR not raise it. Is the STAR version the same? STAR has undergone a bunch of changes, but I'm not sure if that falls into the same time window. I have been thinking that implementing STAR 2-pass might be helpful when calling variants on RNA-seq data, but I haven't done it yet.
Hi there,
I'm a budding young Master's student who's supervisor is impressed by the work the BC Bio team has been doing. As far as I know, there are still no truth sets for RNA-sea data, so I'm wondering how I would go about validating the precision of my own pipeline, as seems to be happening here for BC Bio.
You'll have to forgive my ignorance if this seems a silly question, I haven't been working with bioinformatics for very long and it's sometimes hard for me to parse all of the information that's out there. Could I use some combination of existing truth sets in the platinum genomes and exomes to help me assess my own pipeline?
I really appreciate your time and understand that it might not be a priority to answer my question here, in-case it's not the best place to ask it. Keep up the good work!
Hi @TManning7!
You are right that there aren't great truth sets for RNA-seq data-- for a while folks were comparing to qPCR sets and what not, but then those are assuming the qPCR is correct. For RNA-seq data, looking at expression, we use simulated data since you can generate an actual truth set. Here is a simple example dataset we and other groups often test against:
https://www.ebi.ac.uk/arrayexpress/experiments/E-MTAB-4119
That would be a good place to start if you are looking to see how well your pipeline is doing on RNA-seq data. There have been a few publications that have used it as well.
Hope that helps some!
Oh, I just realized this is in the RNA-seq variant calling thread. If you're looking to test RNA-seq variant calls, Sergey has a really nice writeup on an old NA12878 dataset: https://github.com/naumenko-sa/crt/tree/master/validation
Thank you so much!
Happy to help! We have a lot of nice short course materials here https://github.com/hbctraining/Training-modules that might be useful, if you are working on doing RNA-seq expression type experiments. I think they are an awesome introduction to doing analyses. We also have longer course materials here: https://github.com/hbctraining/In-depth-NGS-Data-Analysis-Course that is more in depth than the small workshops. Good luck!
Hello everyone,
Can someone advise me if it is still recommended to use GATK3.8 for RNA-seq variant calling instead of GATK4?
Thanks Komal
Hi!
Yes, I'm currently refreshing validations of it and even can't get gatk4 to call variants. Follow the user story here https://bcbio-nextgen.readthedocs.io/en/latest/contents/rnaseq_variants.html
S.
closing, following up in #3161
Hey guys!
Thanks for the great pipeline!
Has anybody checked the precision of RNA-seq variant calling since the introduction of gatk4? Previously, with gatk 3x after rna-editing filtration it was pretty decent (precision=90-95%).
I have not done the NA12878 test for now, but the new output looks like having a crazy amount of low quality long indels.
Would it be possible to fix gatk4 parameters/filters, or at least turn it off in RNA-seq (to use gatk3.8 instead)?
I can run the NA12878 test if you need it.
Sergey