bcbio / bcbio-nextgen

Validated, scalable, community developed variant calling, RNA-seq and small RNA analysis
https://bcbio-nextgen.readthedocs.io
MIT License
988 stars 354 forks source link

Skipping vcf2db.py? #3091

Closed tathey closed 4 years ago

tathey commented 4 years ago

I'm running the pipeline on a practice AshkenaziFam dataset before trying it on my own data. I'm having trouble getting the AshkenaziFam-ensemble.db file at the end of my run. Previously, I was running an older version of the pipeline where VEP wasn't working properly (version 1.1.6). When I was running this version the AshkenaziFam-ensemble.db file was being created no problem. When I updated to the latest version so that VEP would run, the AshkenaziFam-ensemble.db file was suddenly no longer being created. The other files are still there (AshkenaziFam-freebayes.db, etc), just the ensemble one is missing. Everytime I run the pipeline I run it as a brand new instance in a new folder, so it doesn't have anything to do with the program thinking files are already there.

I tried to go through my new log and command files to compare it to the old ones to see what the differences were. My old command file seems to run the following lines:

[2019-11-14T04:49Z] /home/atheyt@med.umanitoba.ca/local/share/bcbio/anaconda/bin/vcfanno -p 7 -lua /home/atheyt@med.umanitoba.ca/Pipeline_Tests/Test11/AshkenaziFam/work/gemini/AshkenaziFam-ensemble-nomultiallelic-annotated-cre.vcfanno-combine.lua /home/atheyt@med.umanitoba.ca/Pipeline_Tests/Test11/AshkenaziFam/work/gemini/AshkenaziFam-ensemble-nomultiallelic-annotated-cre.vcfanno-combine.conf /home/atheyt@med.umanitoba.ca/Pipeline_Tests/Test11/AshkenaziFam/work/gemini/AshkenaziFam-ensemble-nomultiallelic.vcf.gz | sed -e 's/Number=A/Number=1/g' | bgzip -c > /home/atheyt@med.umanitoba.ca/Pipeline_Tests/Test11/AshkenaziFam/work/bcbiotx/tmp8ymm4ngf/AshkenaziFam-ensemble-nomultiallelic-annotated-cre.vcfanno.vcf.gz [2019-11-14T05:17Z] /home/atheyt@med.umanitoba.ca/local/share/bcbio/anaconda/bin/tabix -f -p vcf /home/atheyt@med.umanitoba.ca/Pipeline_Tests/Test11/AshkenaziFam/work/bcbiotx/tmpe1hdi82u/AshkenaziFam-ensemble-nomultiallelic-annotated-cre.vcfanno.vcf.gz [2019-11-14T05:17Z] /home/atheyt@med.umanitoba.ca/local/bin/vcf2db.py /home/atheyt@med.umanitoba.ca/Pipeline_Tests/Test11/AshkenaziFam/work/gemini/AshkenaziFam-ensemble-nomultiallelic-annotated-cre.vcfanno.vcf.gz /home/atheyt@med.umanitoba.ca/Pipeline_Tests/Test11/AshkenaziFam/work/gemini/AshkenaziFam-ensemble-nomultiallelic.ped /home/atheyt@med.umanitoba.ca/Pipeline_Tests/Test11/AshkenaziFam/work/bcbiotx/tmpi8zpzu2n/AshkenaziFam-ensemble.db

While the newer log file only has: [2020-02-01T07:08Z] /home/atheyt@med.umanitoba.ca/local/share/bcbio/anaconda/bin/vcfanno -p 7 -lua /home/atheyt@med.umanitoba.ca/Pipeline_Tests/Test14/AshkenaziFam/work/gemini/AshkenaziFam-ensemble-annotated-cre.vcfanno-combine.lua /home/atheyt@med.umanitoba.ca/Pipeline_Tests/Test14/AshkenaziFam/work/gemini/AshkenaziFam-ensemble-annotated-cre.vcfanno-combine.conf /home/atheyt@med.umanitoba.ca/Pipeline_Tests/Test14/AshkenaziFam/work/gemini/AshkenaziFam-ensemble.vcf.gz | bgzip -c > /home/atheyt@med.umanitoba.ca/Pipeline_Tests/Test14/AshkenaziFam/work/bcbiotx/tmpzjahisfy/AshkenaziFam-ensemble-annotated-cre.vcfanno.vcf.gz [2020-02-01T07:37Z] /home/atheyt@med.umanitoba.ca/local/share/bcbio/anaconda/bin/tabix -f -p vcf /home/atheyt@med.umanitoba.ca/Pipeline_Tests/Test14/AshkenaziFam/work/bcbiotx/tmpeij5339x/AshkenaziFam-ensemble-annotated-cre.vcfanno.vcf.gz

It seems to skip over the vcf2db.py command. (I'm assuming the slight difference in the naming is because VEP is now running.) I tried to run vcf2db on my own, but the .ped file is missing and I can't figure out where this file is created.

My log file seems to indicate that gemini isn't running, but I also can't find the command where gemini is supposed to run and I have no idea why it's only skipping over gemini the once and not for the other instances as well.

[2020-02-01T07:08Z] multiprocessing: prep_gemini_db [2020-02-01T07:08Z] Not running gemini, no samples with variants found: 151002_7001448_0359_AC7F6GANXX_Sample_HG002-EEogPU_v02-KIT-Av5_AGATGTAC_L008_posiSrt_markDup, 151002_7001448_0359_AC7F6GANXX_Sample_HG003-EEogPU_v02-KIT-Av5_TCTTCACA_L008_posiSrt_markDup, 151002_7001448_0359_AC7F6GANXX_Sample_HG004-EEogPU_v02-KIT-Av5_CCGAAGTA_L008_posiSrt_markDup

I also tried updating Gemini using "bcbio_nextgen.py upgrade --datatarget gemini" but that didn't seem to help.

Any ideas? I've uploaded the latest log and command files.

bcbio-nextgen.log bcbio-nextgen-commands.log

roryk commented 4 years ago

Thank you, could you upload the ../config/whateveryouarerunningwith.yaml configuration file as well, so we can take a look? Sorry for the slow response, we've been busy/away the past few days.

naumenko-sa commented 4 years ago

Hi @tathey !

Thanks for reporting! So the issue is that gemini.db is not created for ensemble calling. There are no errors in the bcbio log itself.

Could you add your yaml and confirm you bcbio and vep versions, and also your annotation configuration (you are using vcfanno conf from cre, that could have been updated since 1.1.6)?

bcbio_nextgen.py --version
vep

Sergey

tathey commented 4 years ago

Sorry for taking so long to respond. I was away last week.

Here is the yaml file AshkenaziFam.yaml.txt (I had to add .txt to the name in order for github to let me upload it). I'm running bcbio version 1.2.0a, ensembl 99.d3e7d31, ensembl-funcgen 99.0832337, ensembl-io 99.441b05b, ensembl-variation 99.642e1cd, ensembl-vep 99.0.

There don't seem to be any errors in the bcbio log. It just says that it's skipping the gemini calling step because it couldn't find samples with variants. The Ashkenazi-ensemble.vcf.gz does have variants in it though.

I was talking to Dennis at CCM. He said that they use bcbio version 1.1.2 for their scripts, so I'm thinking the vcfanno conf would be for that. He also pointed out that my file names are 151002_7001448_0359_AC7F6GANXX_Sample_HG002-EEogPU_v02-KIT-Av5_AGATGTAC_L008.posiSrt.markDup but the error is calling them 151002_7001448_0359_AC7F6GANXX_Sample_HG002-EEogPU_v02-KIT-Av5_AGATGTAC_L008_posiSrtmarkDup. He was wondering if the change from "." to "" is what's causing the error. I had also run a different file through the pipeline that had a simple name that ended in .fastq.gz. I got the same error and it also changed the name to _fastq_gz

Thanks again for looking at this. I should be quicker at responding now that I'm back from vacation.

Taryn

naumenko-sa commented 4 years ago

Hi Taryn @tathey !

Thanks for the follow up!

I'd also suggest not to use complex file names and sample names, as these strings go into bam files (readgroup) and into the gemini database. Avoiding dots in the sample name and @ (and other special characters) in the filenames might help as well.

currently I'm running it with:

  analysis: variant2
  description: HG004_NA24143_mother
  files:
  - /path/to/ashkenazim_trio/input/HG004_NA24143_mother.bam.chr22.bam

Hopefully, it would help to resolve, let us know otherwise. I am also running your config with chr22 - will see soon what I get.

Some more thoughts:

  1. Ashkehazim trio is a very important use case for bcbio, since it is a user entry point for Clinical and Clinical Research variant calling users. If you'd like to contribute the user story, that would be fantastic. We would include it into the docs (just write up what you are doing as an end user: NIST article, where did you get input files, bcbio config, runnig bcbio, results, validation results).
  2. Would you consider running analyses with hg38? Currently we support gnomad v3 which are native calls of gnomad genomes in hg38. Exomes are still lifted over from grch37 to hg38, but having native hg38 genome calls is a game changer.
  3. What is your opinion on using 4 variant callers and ensembl calling? Several years ago when calling algorithms were more or less on par it made sense, but now you may just use gatk4, and consider all variants, not only PASS filters if you would like to pull 'junk' variants.
  4. I'd like to review https://github.com/naumenko-sa/cre and move vcfanno.conf and some scripts to bcbio to have an excel of prioritized variants among bcbio output files. Any suggestions how it should be done for the benefit of a wider user base is appreciated.

Sergey

naumenko-sa commented 4 years ago

Hi @tathey !

Finally, I was able to reproduce the issue, and found that since Oct 3, 2019 gemini.db is not created for ensemble: https://github.com/bcbio/bcbio-nextgen/commit/8d60e8243ed112f0ac1f558a446cb08c786c0c1c#diff-9d644e8df7caebccd9435edfa4ad2484

@chapmanb could you please clarify this change? Was it because of issues with ensemble.db in somatic calling (vardict+mutect2+strelka2)?

In germline ensemble gemini db worked well with ensemble: gatk/samtools/freebayes/platypus.

Isn't it a redundant switch?

Would it be an option for the original issue not to switch on gemini, or not to use ensemble?

@tathey: on the practical side, if you are establishing a pipeline, would you consider for germline calling just to use gatk and have gemini.db for it? Include all calls, if you want to investigate 'trash' ones. Combining calls from multiple callers is tricky, and the gain is questionable, so we won't be able to prioritize patching ensemble/ensemble+gemini calling if you hit some issues in the future.

Sergey

chapmanb commented 4 years ago

Sergey; Sorry about the confusion here. The issue with ensemble in general is that it pulls together calls from multiple callers. This was creating issues where GEMINI was unhappy due to inconsistency of tags between different caller INFO fields and the headers.

I don't remember the exact motivator and callers that triggered it, but it's a general issue since we don't try to harmonize ensemble calls. So what we tried to do is allow users to turn on gemini so they could get it for individual callers, but avoid the intermittent errors from vcf2db that were due to ensemblization.

I agree with your advice that sticking to a single caller for looking into gemini databases is the best approach. Hope this helps.

tathey commented 4 years ago

Thank you both for all of your help. This explains a lot, since I couldn't figure out why my previous download was creating the ensemble.db file and next download was not.

I was able to run vcf2db.py by creating my own ensemble.ped file (with the help of CCM), but the cre step failed to run because it could not find CADD scores. Previously, I was told to stop running the dbNSFP part of the pipeline because the pipeline kept stalling at this step. I'm pretty sure this is where CADD scores get added, so I'm going to work towards getting this part working.

I've never looked into the stats of what variants get called by each caller, but I'm hesitant to hold up one single variant caller as a golden standard to use above all others. Sergey, you mentioned that these callers are no longer on par, so it sounds like your opinion is that GATK is giving us the most accurate calls out of all of the callers. I know that in the past (when I used to work on bacteria) I was having trouble automatically combining calls between Freebayes and GATK, so I can see how only using one caller would eliminate a huge complication.

naumenko-sa commented 4 years ago

Thanks, Brad!

Once @hackdna unfreezes documentation, we will clarify there how ensemble&gemini interact.

@tathey:

FreeBayes had inferior results in indels in WES (high FDR, FNR) and running FreeBayes for WGS was problematic (gatk is much faster).

~2 years ago I validated gatk4 vs ensemble2of4 and other callers and the conclusion was that ensemble did give some +39 variants of 38K target, but also was much worse in indels (that is the difference of gatk4 vs other callers - they are worse in indel calling and that makes ensemble worse in indel calling).

If you do clinical or clinical research variant calling, you should validate NA12878 and/or Askenazi trio yourself. If you do so with bcbio, we'd be happy to publish validation results in the docs: https://github.com/bcbio/bcbio-nextgen/issues/3038

Sergey

tathey commented 4 years ago

Thanks for your in-depth responses. I really appreciate them.

I tried adding dbNSFP back into the configuration and sure enough it failed again.

I think that this current Gemini issue is resolved, so I will close it and open a new issue about dbNSFP.

Taryn