bcbio / bcbio-nextgen

Validated, scalable, community developed variant calling, RNA-seq and small RNA analysis
https://bcbio-nextgen.readthedocs.io
MIT License
994 stars 354 forks source link

feature request: start from vcf #2068

Closed matthdsm closed 5 years ago

matthdsm commented 7 years ago

Hi Brad,

We're starting to feel some storage constraints through our usage of bcbio. We'd like to be able to reannotate the resulting vcf files using the bcbio framework, but without having to keep all of the other intermediate files around.

Would it be possible to let the pipeline start from vcf files and perform just the annotations?

Thanks M

chapmanb commented 7 years ago

Matthias; We currently have some support for this, using vrn_file instead of files in the input configuration. Here's an example from the tests of including a pre-built VCF in a validation:

https://github.com/chapmanb/bcbio-nextgen/blob/a6676f45cafbe9f0ca24ce798a61c27ede36a36c/tests/data/automated/run_info-bam.yaml#L11-L21

I don't use this extensively so am not sure if it does everything you need but that would be the best place to get started. I'd like to expand this to support pre-built gVCF inclusion in joint calling workflows, so let us know if you run into issues.

matthdsm commented 7 years ago

Allright, sounds great! Happy to hear there's already some support. We're starting from pre-built GVCF's so it'd be a huge help if this would work. I'll give it a shot and get back to you if I run into any issues.

Cheers and thanks for the foresight. M

matthdsm commented 6 years ago

Hi Brad,

I've finally come around to testing this feature (better late than never I suppose). So here's the case. I've got some gvcf's from previous runs, which I'm now trying to combine into one multisample gemini db. I'd like to get bcbio to

To do this, I'm using the following config with bcbio v1.0.9:

fc_name: family1
upload:   
  dir: ../final
globals:
  analysis_regions: /genomes/Hsapiens/hg38/regions/RefSeqExomeAndPanels_20171003.bed
  coverage_regions: /genomes/Hsapiens/hg38/regions/RefSeqExomeAndPanels_20171003.bed
resources:
  tmp:
    dir: /tmp/bcbio
details:
  - analysis: variant2
    genome_build: hg38
    description: D1712121
    vrn_file: ../input/D1712121-gatk-haplotype.vcf.gz
    metadata:
      batch: family1
    algorithm:
      variant_regions: analysis_regions
      jointcaller: gatk-haplotype-joint
      effects: vep
      effects_transcripts: all
      vcfanno: [eog,dbscsnv,dbnsfp]
      tools_on:
        - vep_splicesite_annotations
  - analysis: variant2
    genome_build: hg38
    description: D1611710
    vrn_file: ../input/D1611710-gatk-haplotype.vcf.gz
    metadata:
      batch: family1
    algorithm:
      variant_regions: analysis_regions
      jointcaller: gatk-haplotype-joint
      effects: vep
      effects_transcripts: all
      vcfanno: [eog,dbscsnv,dbnsfp]
      tools_on:
        - vep_splicesite_annotations
  - analysis: variant2
    genome_build: hg38
    description: D1712122
    vrn_file: ../input/D1712122-gatk-haplotype.vcf.gz
    metadata:
      batch: family1
    algorithm:
      variant_regions: analysis_regions
      jointcaller: gatk-haplotype-joint
      effects: vep
      effects_transcripts: all
      vcfanno: [eog,dbscsnv,dbnsfp]
      tools_on:
        - vep_splicesite_annotations

This does work, but there's no joint calling, no db's are created, and no results ,except some QC, is copied to the final directory.

Do you have any pointers on how to get where I want to be?

Thanks a lot M

chapmanb commented 6 years ago

Matthias; Thanks for following up on this. It looks like you did everything right in terms of configuration, I just haven't had a chance to work on gVCF inputs and joint calling as part of bcbio yet, so this seems like the expected outcome right now. In terms of pointers, are you looking to develop the features or hoping that we already had that in place? I'd like to try to work on this as part of the CWL transition, which is happening right now, but this would come after some other CWL features we need to add like ensemble calling and full RNA-seq support with variant calling. Hope this helps some with letting you know the current status and priorities, and thanks for pushing for this.

matthdsm commented 6 years ago

Hi Brad,

Thanks for the update! I'd be happy to help develop the needed features, as this is becoming something crucial for us, fast. If you can point me in the right direction, I'll try to make time.

Thanks again M

matthdsm commented 6 years ago

Hi Brad!

Any more updates on this? As I said, I'd love to help implement this.

Cheers M

chapmanb commented 6 years ago

Matthias; Sorry to be slow in getting back with you. I've been struggling to make time to run through and identify the necessary changes. I really appreciate your willingness to help and hope to have more time after the BOSC/GCC conference finishes next week. I'll definitely follow up and hope this works okay with your needs. Thanks again for the patience.

matthdsm commented 6 years ago

Hi Brad,

Thanks for the reply. Sadly, I won't be able to make it to BOSC this year, otherwise I would've loved to meet up. A direct colleague of mine is attending however. Would you be interested in having a small birds of a feather meeting to discuss some features and the future of bcbio? I think there'll be a huge benefit to having a talk IRL. @tomiles, your thoughts on this?

Cheers M

chapmanb commented 6 years ago

Matthias and Tom; Thanks for this. We made great progress during the CoFest and I believe having this working cleanly in the latest development version. It should now use the input gVCFs as specified previously and run joint calling on them, producing a combined VCF with annotations. Your configuration from before would remain identical, this just fixes logic in bcbio itself which skipped joint calling for this case and now will correctly enable it.

Please let us know if you run into any issues and happy to work more. Thanks again for the push to make this happen and all the feedback and testing.

matthdsm commented 6 years ago

Hi Brad,

I just ran a test on a trio, starting from GVCF. Joint calling seems to work nicely. However, we're still missing the rest of the steps. There's only VEP annotation, but no vcfanno and no gemini . Also, it seems like bcbio only uploads the first sample from the config, skipping al the others.

content of the final dir are:

2018-07-02_Proband_16.10218:
bcbio-nextgen-commands.log  data_versions.csv  multiqc                                                 Proband_16_10218-gatk-haplotype-joint-annotated.vcf.gz.tbi  project-summary.yaml
bcbio-nextgen.log           metadata.csv       Proband_16_10218-gatk-haplotype-joint-annotated.vcf.gz  programs.txt

D1712121:
qc

While the Trio vcf header does include all three samples:

##GATKCommandLine=<ID=GenotypeGVCFs,CommandLine="GenotypeGVCFs  --output /user/scratch/gent/gvo000/gvo00082/vsc41443/bcbio/tmpUPLMpf/Proband_16_10218-chr1_0_248956422.vcf.gz --use-new-qual-calculator true --variant gendb:///kyukon/data/gent/gvo000/gvo00082/vsc41443/GVCF_test/Proband_16.10218/work/joint/gatk-haplotype-joint/Proband_16_10218/chr1/Proband_16_10218-chr1_0_248956422_genomicsdb --intervals chr1:1-248956422 --reference /kyukon/data/gent/gvo000/gvo00082/vsc41443/bcbio/genomes/Hsapiens/hg38/seq/hg38.fa  --annotate-with-num-discovered-alleles false --heterozygosity 0.001 --indel-heterozygosity 1.25E-4 --heterozygosity-stdev 0.01 --standard-min-confidence-threshold-for-calling 10.0 --max-alternate-alleles 6 --max-genotype-count 1024 --sample-ploidy 2 --annotation-group StandardAnnotation --disable-tool-default-annotations false --only-output-calls-starting-in-intervals false --interval-set-rule UNION --interval-padding 0 --interval-exclusion-padding 0 --interval-merging-rule ALL --read-validation-stringency SILENT --seconds-between-progress-updates 10.0 --disable-sequence-dictionary-validation false --create-output-bam-index true --create-output-bam-md5 false --create-output-variant-index true --create-output-variant-md5 false --lenient false --add-output-sam-program-record true --add-output-vcf-command-line true --cloud-prefetch-buffer 40 --cloud-index-prefetch-buffer -1 --disable-bam-index-caching false --help false --version false --showHidden false --verbosity INFO --QUIET false --use-jdk-deflater false --use-jdk-inflater false --gcs-max-retries 20 --disable-tool-default-read-filters false",Version=4.0.3.0,Date="July 2, 2018 8:50:57 AM CEST">
##reference=file:///home/galaxy/bcbio/genomes/Hsapiens/hg38/seq/hg38.fa
##source=GenomicsDBImport
##source=GenotypeGVCFs
##VEP="v92" time="2018-07-02 17:21:54" cache="/kyukon/data/gent/gvo000/gvo00082/vsc41443/bcbio/genomes/Hsapiens/hg38/vep/homo_sapiens_merged/92_GRCh38" ensembl=92.98e8548 ensembl-io=92.39280bd ensembl-funcgen=92.cd2ca86 ensembl-variation=92.77a06cf 1000genomes="phase3" COSMIC="83" ClinVar="201802" ESP="V2-SSA137" HGMD-PUBLIC="20174" assembly="GRCh38.p12" dbSNP="150" gencode="GENCODE 28" genebuild="2014-07" gnomAD="170228" polyphen="2.2.2" refseq="2018-02-20 13:24:28 - GCF_000001405.37_GRCh38.p11_genomic.gff" regbuild="16" sift="sift5.2.2"
##LoF=Loss-of-function annotation (HC = High Confidence; LC = Low Confidence)
##LoF_filter=Reason for LoF not being HC
##LoF_flags=Possible warning flags for LoF
##LoF_info=Info used for LoF annotation
##MaxEntScan_alt=MaxEntScan alternate sequence score
##MaxEntScan_diff=MaxEntScan score difference
##MaxEntScan_ref=MaxEntScan reference sequence score
##SpliceRegion=SpliceRegion predictions
##bcftools_viewVersion=1.8+htslib-1.8
##bcftools_viewCommand=view -h Proband_16_10218-gatk-haplotype-joint-annotated.vcf.gz; Date=Tue Jul  3 08:11:00 2018
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  D1611710        D1712121        D1712122
matthdsm commented 6 years ago

Also, this doesn't work with CWL yet, but I suppose that was to be expected?

Cheers M

chapmanb commented 6 years ago

Matthias; Thanks much for testing this out. Tackling the issues you mentioned:

Thanks again for the testing and feedback.

matthdsm commented 6 years ago

Hi Brad,

Thanks! I'll run some more tests right away. As for the 'missing' data. Since I'm running a trio analysis, I supposed there ought to be a directory for each sample, in addition to the 'project' dir with the combined vcf. As it stands now, there's a directory with QC for only one out of three samples for some reason.

Thanks for all the work.

Cheers M

chapmanb commented 6 years ago

Matthias; Thanks for following up. Since you don't have input BAM files there isn't much else that gets generated per sample. We can do some minimal QC on the joint called VCF, which is attached to one of the samples and that's the output you're seeing. So that's expected behavior, and please let us know if anything appears to be missing from the outputs. I'll work on CWL support for this and any other issues you identify. Thanks again.

matthdsm commented 6 years ago

Hi Brad,

I'm pleased to report annotation works like a charm! Thanks a lot for this!

On another note, I have a question about the annotation pipeline as a whole. I noticed bcbio runs VEP right after the joint calling, and again during the gemini step. This seems to me as a waste of resources. I'm aware there are differences between the vcf files used for gemini and the output from GATK, but would it be possible to check if this is necessary?

Is it possible to build in a check to see if gemini/vcfanno is required and then skip the VEP annotation right after genotyping in favor of VEP annotation after decomposition? I could see this shave of a significant amount of analysis time for the complete pipeline.

Thanks again M

chapmanb commented 6 years ago

Matthias; Thanks for checking this and reporting back, I'm glad to hear it's working as expected.

For the VEP double runs, this is a tricky problem I'm not sure how to best handle. There are downsides to both the standard VCF representation and the decomposed one used for GEMINI annotation and researchers often prefer one over the other so we prepare both and make them available for downstream usage. As a result we do end up having to annotate twice for each of the representations. Essentially we want users to be able to pick up either file depending on their needs.

Are you happy only using the decomposed GEMINI VCF, or also take advantage of the original VCF?

I've been thinking about turning off GEMINI as the default and preferring just a non-normalized vcfanno output unless users ask for it. Then you'd typically avoid the extra work unless you explicitly ask for a GEMINI database generated. Would this approach work for you?

I'm curious to hear from other users as well who might have a preference about standard versus normalized VCFs.

Thanks for starting this discussion.

matthdsm commented 6 years ago

Hi Brad,

From our side, we heavily depend on the GVCF and the Gemini output. The intermediate vcf isn't used at all. I have no issues with turning off gemini as a default, as long as the option stays available.

Is it possible to detatch the annotation from other steps and make it dependent on config options? In a CWL setting, I could see this work. For example,

IF gemini:
    [post variant calling]
    - decompose/normalize
    - annotate with VEP/SnpEff
    - annotate VCFanno
    - create GEMINI db
   [output]
    - gvcf
    - unannotated vcf
    - annotated/decomposed vcf
    - gemini db
    - ...
ELSE
    [post variant calling]
    - annotate with VEP/SnpEff
   [output]
    - gvcf
    - annotated vcf
    - ...

I'm probably not the only one depending on the GEMINI output. It'd be interesting to hear what other have to say.

Cheers M