Ensembl / ensembl-vep

The Ensembl Variant Effect Predictor predicts the functional effects of genomic variants
https://www.ensembl.org/vep
Apache License 2.0
445 stars 151 forks source link

Problem with the VEP ANNOTATE snakemake wrapper #837

Closed moldach closed 4 years ago

moldach commented 4 years ago

I'm trying to adapt my regular VEP code to use the snakemake wrapper instead but am running into an issue.

I want to make sure that a) the wrapper works for me and b) it produces the same results as the following:

vep \
        -i {input.sample} \
        --species "caenorhabditis_elegans" \
        --format "vcf" \
        --everything \
        --offline \
        --force_overwrite \
        --fasta {input.ref} \
        --gff {input.annot} \
        --tab \
        --variant_class \
        --regulatory \
        --show_ref_allele \
        --numbers \
        --symbol \
        --protein \
        -o {params.sample}

In order to use VEP with wrappers there are 3 different rules. I have got the following two working:

# VEP Download Plugins
rule download_vep_plugins:
    output:
        directory("resources/vep/plugins")
    params:
        release=100
    wrapper:
        "0.64.0/bio/vep/plugins"

# VEP Cache
rule get_vep_cache:
    output:
        directory("resources/vep/cache")
    params:
        species="caenorhabditis_elegans",
        build="WBcel235",
        release="100"
    log:
        "logs/vep/cache.log"
    wrapper:
        "0.64.0/bio/vep/cache"

The third rule is to actually run VEP for which I've written the following rule:

rule variant_annotation:
    input:
        calls= lambda wildcards: getVCFs(wildcards.sample),
        cache="resources/vep/cache",
        plugins="resources/vep/plugins",
    output:
        calls=os.path.join(dirs_dict["ANNOT_DIR"],config["ANNOT_TOOL"],"{sample}.annotated.vcf"),
        stats=os.path.join(dirs_dict["ANNOT_DIR"],config["ANNOT_TOOL"],"{sample}.html")
    params:
        plugins=["LoFtool"],
        extra="--everything"
    message: """--- Annotating Variants."""
    resources:
        mem = 30000,
        time = 120
    threads: 4
    wrapper:
        "0.64.0/bio/vep/annotate"

When submitting the job this is the error I recieve:

Building DAG of jobs...
Using shell: /cvmfs/soft.computecanada.ca/nix/var/nix/profiles/16.09/bin/bash
Provided cores: 4
Rules claiming more threads will be scaled down.
Job counts:
        count   jobs
        1       variant_annotation
        1

[Tue Aug 25 11:23:04 2020]
Job 0: --- Annotating Variants.

Activating conda environment: /scratch/moldach/MADDOG/VCF-FILES/biostars439754/.snakemake/conda/f16fdb5f
Failed to open VARIANT_CALLING/varscan/470_sorted_dedupped_snp_varscan.vcf: unknown file type
Possible precedence issue with control flow operator at /scratch/moldach/MADDOG/VCF-FILES/biostars439754/.snakemake/conda/f16fdb5f/lib/site_perl/5.26.2/Bio/DB/IndexedBase.pm line 805.
Traceback (most recent call last):
  File "/scratch/moldach/MADDOG/VCF-FILES/biostars439754/.snakemake/scripts/tmpm4v6gdij.wrapper.py", line 44, in <module>
    "(bcftools view {snakemake.input.calls} | "
  File "/home/moldach/bin/snakemake/lib/python3.8/site-packages/snakemake/shell.py", line 156, in __new__
    raise sp.CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command 'set -euo pipefail;  (bcftools view VARIANT_CALLING/varscan/470_sorted_dedupped_snp_varscan.vcf | vep --everything --fork 4 --format vcf --vcf --cache --cache_version 100 --species caenorhabditis_elegans --assembly WBcel235 --dir_cache resources/vep/cache --dir_plugins resources/vep/plugins --offline --plugin LoFtool --output_file STDOUT --stats_file ANNOTATION/VEP/470.html | bcftools view -Ov > ANNOTATION/VEP/470.annotated.vcf)' returned non-zero exit status 1.
[Tue Aug 25 11:25:02 2020]
Error in rule variant_annotation:
    jobid: 0
    output: ANNOTATION/VEP/470.annotated.vcf, ANNOTATION/VEP/470.html
    conda-env: /scratch/moldach/MADDOG/VCF-FILES/biostars439754/.snakemake/conda/f16fdb5f

RuleException:
CalledProcessError in line 393 of /scratch/moldach/MADDOG/VCF-FILES/biostars439754/Snakefile:
Command 'source /home/moldach/miniconda3/bin/activate '/scratch/moldach/MADDOG/VCF-FILES/biostars439754/.snakemake/conda/f16fdb5f'; set -euo pipefail;  python /scratch/moldach/MADDOG/VCF-FILES/biostars439754/.snakemake/scripts/tmpm4v6gdij.wrapper.py' returned non-zero exit status 1.
  File "/scratch/moldach/MADDOG/VCF-FILES/biostars439754/Snakefile", line 393, in __rule_variant_annotation
  File "/cvmfs/soft.computecanada.ca/easybuild/software/2017/Core/python/3.8.0/lib/python3.8/concurrent/futures/thread.py", line 57, in run
Removing output files of failed job variant_annotation since they might be corrupted:
ANNOTATION/VEP/470.annotated.vcf, ANNOTATION/VEP/470.html
Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message

This issue was originally brought up on StackOverflow last week but hasn't been solved yet so I'm bringing it up here.

ima23 commented 4 years ago

Hi @moldach,

If I understand correctly, the command line VEP runs as expected and you are looking into using snakemake to run VEP. In this case the page below might be of help to you on how to set up the wrappers: https://snakemake-wrappers.readthedocs.io/en/stable/wrappers/vep.html .

Kind regards, Irina

moldach commented 4 years ago

Hi @ima23 I included all three of those links in my message...

To be clear, I already have two of the three wrappers from that link working:

It was only the third wrapper I'm having issues with; please see above the part:

The third rule is to actually run VEP for which I've written the following rule:

Note that input, output and log file paths can be chosen freely when running with snakemake --use-conda so the input for both rules was the same:

Shell (working)

rule variant_annotation:
    input:
        sample = lambda wildcards: getVCFs(wildcards.sample),
        ref = config["REF_GENOME"],
        annot = "config["GENOME_ANNOTATION"]
    output:
        os.path.join(dirs_dict["ANNOT_DIR"],config["ANNOT_TOOL"],"{sample}.ann.vcf"),
        os.path.join(dirs_dict["ANNOT_DIR"],config["ANNOT_TOOL"],"{sample}.ann.vcf_summary.html"),
        os.path.join(dirs_dict["ANNOT_DIR"],config["ANNOT_TOOL"],"{sample}.ann.vcf_warnings.txt")
    log: os.path.join(dirs_dict["LOG_DIR"],config["ANNOT_TOOL"],"{sample}.log")
    params:
        sample = os.path.join(dirs_dict["ANNOT_DIR"],config["ANNOT_TOOL"],"{sample}.ann.vcf")
    message: """--- Annotating Variants."""
    resources:
        mem = 30000,
        time = 120
    threads: 1
    shell: """
        source ~/bin/activate_my_perl.sh
        module load htslib
        ~/projects/def-mtarailo/common/tools/ensembl-vep/vep \
        -i {input.sample} \
        --species "caenorhabditis_elegans" \
        --format "vcf" \
        --everything \
        --offline \
        --force_overwrite \
        --fasta {input.ref} \
        --gff {input.annot} \
        --tab \
        --variant_class \
        --regulatory \
        --show_ref_allele \
        --numbers \
        --symbol \
        --protein \
        -o {params.sample}
        """

Wrapper (not working)

rule variant_annotation:
    input:
        calls= lambda wildcards: getVCFs(wildcards.sample),
        cache="resources/vep/cache",
        plugins="resources/vep/plugins",
    output:
        calls=os.path.join(dirs_dict["ANNOT_DIR"],config["ANNOT_TOOL"],"{sample}.annotated.vcf"),
        stats=os.path.join(dirs_dict["ANNOT_DIR"],config["ANNOT_TOOL"],"{sample}.html")
    params:
        plugins="",
        extra=""
    message: """--- Annotating Variants."""
    resources:
        mem = 30000,
        time = 120
    threads: 4
    wrapper:
        "0.64.0/bio/vep/annotate"

This seems to be the command run via the wrapper:

subprocess.CalledProcessError: Command 'set -euo pipefail;  (bcftools view VARIANT_CALLING/varscan/470_sorted_dedupped_snp_varscan.vcf | vep --everything --fork 4 --format vcf --vcf --cache --cache_version 100 --species caenorhabditis_elegans --assembly WBcel235 --dir_cache resources/vep/cache --dir_plugins resources/vep/plugins --offline --plugin LoFtool --output_file STDOUT --stats_file ANNOTATION/VEP/470.html | bcftools view -Ov > ANNOTATION/VEP/470.annotated.vcf)' returned non-zero exit status 1.

And the pertinent information from that error message (above) was:

Failed to open VARIANT_CALLING/varscan/470_sorted_dedupped_snp_varscan.vcf: unknown file type

This is odd - as I mentioned the same input was used with shell and worked?

Thanks

ima23 commented 4 years ago

Hi @moldach,

I understand your issue and sorry for not following the links to the posts to see it was the same related post.

Have you tried contacting the authors of the snakemake wrapper to check if they can help you with it? Unfortunately it sounds like it is a snakemake setup issue, when the file is not found before it is sent to VEP for analysis, and not a VEP issue; in which case they should be able to help. You could double check the getVCFs call and how VARIANT_CALLING is set/used.

If you have other VEP questions please let us know.

Kind regards, Irina

moldach commented 4 years ago

Hi @ima23 thanks for your feedback. I've followed up with the author @johanneskoester

ima23 commented 4 years ago

Thanks @moldach. I will close this ticket for now, do open it if you have related questions or open a new one. Kind regards, Irina