PMBio / deeprvat

Other
35 stars 5 forks source link

An error (related to rule concat_deepSea) when running https://deeprvat.readthedocs.io/en/latest/annotations.html#running-the-pipeline #117

Closed pichuan closed 3 months ago

pichuan commented 3 months ago

Hi,

When I'm running this step in https://deeprvat.readthedocs.io/en/latest/annotations.html#running-the-pipeline:

(deeprvat_annotations) pichuan@pichuan-gpu:~/deeprvat/example/annotations$ time snakemake -j $(nproc) -s ../../pipelines/annotations.snakefile --configfile ../config/deeprvat_annotation_config.yaml --use-conda

I got this error:

[Sat Aug  3 21:38:28 2024]
Error in rule concat_deepSea:
    jobid: 15
    output: output_dir/annotations/all_variants.deepSea.parquet
    shell:
        deeprvat_annotations concatenate-deepsea  output_dir/annotations/all_variants.deepSea.parquet 8
        (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)

It seems like the usage here is missing one flag. Specifically, when I ran that command, I got:

Usage: deeprvat_annotations concatenate-deepsea [OPTIONS] DEEPSEA_FILES
                                                OUT_FILE NJOBS

I searched a bit, and found that in the ../../pipelines/annotations/deepSEA.snakefile file, there is this rule:

rule concat_deepSea:
    input:
        deepSEAscoreFiles=expand(rules.deepSea.output, file_stem=file_stems),
    params:
        joined=lambda w, input: ",".join(input.deepSEAscoreFiles),
    threads: 8
    resources:
        mem_mb=lambda wildcards, attempt: 50_000 * (attempt + 1),
    output:
        anno_dir / "all_variants.deepSea.parquet",
    shell:
        " ".join(
            [
                "deeprvat_annotations",
                "concatenate-deepsea",
                "{params.joined}",
                "{output}",
                "{threads}",
            ]
        )

For whatever reason, it doesn't seem like I have the deepSEAscoreFiles , which I assume should be generated by a previous step(?).

So far I'm still only running through the documentation steps, so I'm not using my own VCF input files yet. (And I did already go through the preprocessing pipeline.)

If you know what I might be missing here, please let me know. Thank you!

Marcel-Mueck commented 3 months ago

hi @pichuan, it looks like your file_stems variable is an empty list for some reason. Are you running it on example data or your own? how does your example/config/deeprvat_annotation_config.yaml look like, especially the source_variant_file_pattern? Does it match with the names of the input data you're using? Regards, Marcel

pichuan commented 3 months ago

Hi @Marcel-Mueck ,

(1) I'm still running through the example input data. At this point, I've finished running https://deeprvat.readthedocs.io/en/latest/preprocessing.html. And according to the documentation there, I believe these were the output from the preprocessing pipeline:

(deeprvat_annotations) pichuan@pichuan-gpu:~/deeprvat/example/annotations$ ls ../../example/preprocess/workdir/preprocesed/
genotypes.h5  genotypes_chr21.h5  genotypes_chr22.h5

(2) Here is what's in my deeprvat_annotation_config.yaml:

(deeprvat_annotations) pichuan@pichuan-gpu:~/deeprvat/example/annotations$ grep source_variant_file_pattern ../config/deeprvat_annotation_config.yaml
source_variant_file_pattern : test_vcf_data_c{chr}_b{block}

Actually, at this point, I haven't modified deeprvat_annotation_config.yaml, so it's exactly the same as the original version. I know I'll need to comment out lines like spliceAI and primateAI etc because I won't be using them. I'll do that soon!

(3) I don't know exactly what the source_variant_file_pattern should be point to, so I searched for every files with that pattern under my directory now:

(deeprvat_annotations) pichuan@pichuan-gpu:~/deeprvat/example/annotations$ find ../../ -type f -name *test_vcf_data_c*
../../tests/preprocessing/test_data/process_individual_missingness/one_missing/input/indmiss/sites/test_vcf_data_c22_b1.tsv
../../tests/preprocessing/test_data/process_individual_missingness/one_missing/input/indmiss/sites/test_vcf_data_c21_b1.tsv
../../tests/preprocessing/test_data/process_individual_missingness/one_missing/input/indmiss/samples/test_vcf_data_c22_b1.tsv
../../tests/preprocessing/test_data/process_individual_missingness/one_missing/input/indmiss/samples/test_vcf_data_c21_b1.tsv
../../tests/preprocessing/test_data/process_individual_missingness/one_missing/input/indmiss/stats/test_vcf_data_c22_b1.stats
../../tests/preprocessing/test_data/process_individual_missingness/one_missing/input/indmiss/stats/test_vcf_data_c21_b1.stats
../../tests/preprocessing/test_data/process_individual_missingness/two_missing/input/indmiss/sites/test_vcf_data_c22_b1.tsv
../../tests/preprocessing/test_data/process_individual_missingness/two_missing/input/indmiss/sites/test_vcf_data_c21_b1.tsv
../../tests/preprocessing/test_data/process_individual_missingness/two_missing/input/indmiss/samples/test_vcf_data_c22_b1.tsv
../../tests/preprocessing/test_data/process_individual_missingness/two_missing/input/indmiss/samples/test_vcf_data_c21_b1.tsv
../../tests/preprocessing/test_data/process_individual_missingness/two_missing/input/indmiss/stats/test_vcf_data_c22_b1.stats
../../tests/preprocessing/test_data/process_individual_missingness/two_missing/input/indmiss/stats/test_vcf_data_c21_b1.stats
../../tests/preprocessing/test_data/process_individual_missingness/no_missing/input/indmiss/sites/test_vcf_data_c22_b1.tsv
../../tests/preprocessing/test_data/process_individual_missingness/no_missing/input/indmiss/sites/test_vcf_data_c21_b1.tsv
../../tests/preprocessing/test_data/process_individual_missingness/no_missing/input/indmiss/samples/test_vcf_data_c22_b1.tsv
../../tests/preprocessing/test_data/process_individual_missingness/no_missing/input/indmiss/samples/test_vcf_data_c21_b1.tsv
../../tests/preprocessing/test_data/process_individual_missingness/no_missing/input/indmiss/stats/test_vcf_data_c22_b1.stats
../../tests/preprocessing/test_data/process_individual_missingness/no_missing/input/indmiss/stats/test_vcf_data_c21_b1.stats
../../example/annotations/input_dir/vcf/test_vcf_data_c21_b1.vcf.gz
../../example/annotations/input_dir/vcf/test_vcf_data_c22_b1.vcf.gz
../../example/preprocess/workdir/norm/variants/test_vcf_data_c22_b1.tsv.gz
../../example/preprocess/workdir/norm/variants/test_vcf_data_c21_b1.tsv.gz
../../example/preprocess/workdir/norm/bcf/test_vcf_data_c21_b1.bcf
../../example/preprocess/workdir/norm/bcf/test_vcf_data_c22_b1.bcf
../../example/preprocess/workdir/norm/sparse/test_vcf_data_c22_b1.tsv.gz
../../example/preprocess/workdir/norm/sparse/test_vcf_data_c21_b1.tsv.gz
../../example/preprocess/workdir/qc/read_depth/test_vcf_data_c22_b1.tsv.gz
../../example/preprocess/workdir/qc/read_depth/test_vcf_data_c21_b1.tsv.gz
../../example/preprocess/workdir/qc/indmiss/sites/test_vcf_data_c22_b1.tsv
../../example/preprocess/workdir/qc/indmiss/sites/test_vcf_data_c21_b1.tsv
../../example/preprocess/workdir/qc/indmiss/samples/test_vcf_data_c22_b1.tsv
../../example/preprocess/workdir/qc/indmiss/samples/test_vcf_data_c21_b1.tsv
../../example/preprocess/workdir/qc/indmiss/stats/test_vcf_data_c22_b1.stats
../../example/preprocess/workdir/qc/indmiss/stats/test_vcf_data_c21_b1.stats
../../example/preprocess/workdir/qc/allelic_imbalance/test_vcf_data_c22_b1.tsv.gz
../../example/preprocess/workdir/qc/allelic_imbalance/test_vcf_data_c21_b1.tsv.gz
../../example/preprocess/workdir/qc/hwe/test_vcf_data_c22_b1.tsv.gz
../../example/preprocess/workdir/qc/hwe/test_vcf_data_c21_b1.tsv.gz
../../example/preprocess/workdir/qc/varmiss/test_vcf_data_c22_b1.tsv.gz
../../example/preprocess/workdir/qc/varmiss/test_vcf_data_c21_b1.tsv.gz
../../example/preprocess/data/vcf/test_vcf_data_c21_b1.vcf.gz
../../example/preprocess/data/vcf/test_vcf_data_c22_b1.vcf.gz

Looking a bit closely, I wonder if this line in the deeprvat_annotation_config.yaml file doesn't match the file structure above?

(deeprvat_annotations) pichuan@pichuan-gpu:~/deeprvat/example/annotations$ grep source_variant_dir ../config/deeprvat_annotation_config.yaml
source_variant_dir : preprocessing_workdir/norm/bcf

Did I do something wrong in my steps, or is this something that you should update in the example documentation as well? Thanks!

Marcel-Mueck commented 3 months ago

linking #118

Marcel-Mueck commented 3 months ago

Hey @pichuan I updated the docs and performed some small changes to the config in #118 . We will soon merge this with the master branch

Marcel-Mueck commented 3 months ago

Hello @pichuan, the PR #118 is merged now. I corrected some path of the annotation config file: image

I also updated the docs, so when you want to run the example data, I can recommand checking this part: https://deeprvat.readthedocs.io/en/latest/annotations.html#running-the-annotation-pipeline-on-example-data and when you want to run it on your own data, this part is relevent: https://deeprvat.readthedocs.io/en/latest/annotations.html#running-the-pipeline-on-your-own-data Specifically, when you want to run it without the spliceAI and primateAI data, you can modify the additional_vep_plugin_cmds inside the config and add include_absplice: False to the config. Regards, Marcel Mück

pichuan commented 3 months ago

Thank you!

On the same machine, I get the latest version of the code. I reran the preprocessing step first, and then:

I have this at the end of my yaml file:

include_absplice: False
additional_vep_plugin_cmds:
  cadd : CADD,annotation_data/cadd/whole_genome_SNVs.tsv.gz,annotation_data/cadd/gnomad.genomes.r3.0.indel.tsv.gz
  #spliceAI : SpliceAI,snv=annotation_data/spliceAI/spliceai_scores.raw.snv.hg38.vcf.gz,indel=annotation_data/spliceAI/spliceai_scores.raw.indel.hg38.vcf.gz
  #primateAI : PrimateAI,annotation_data/primateAI/PrimateAI_scores_v0.2_GRCh38_sorted.tsv.bgz
  condel: Condel,repo_dir/ensembl-vep/Plugin/config/Condel/config,s,2
  alphamissense : AlphaMissense,file=annotation_data/AlphaMissense/AlphaMissense_hg38.tsv.gz

Then I ran this command, and got this error:

(deeprvat_annotations) pichuan@pichuan-gpu:~/deeprvat/example/annotations$ time snakemake -j $(nproc) -s ../../pipelines/annotations.snakefile --configfile ../config/deeprvat_annotation_config.yaml --use-conda
Building DAG of jobs...
MissingInputException in rule vep  in line 246 of /home/pichuan/deeprvat/pipelines/annotations.snakefile:
Missing input files for rule vep:
    output: output_dir/annotations/test_vcf_data_c21_b1_vep_anno.tsv
    wildcards: file_stem=test_vcf_data_c21_b1
    affected files:
        reference/GRCh38.primary_assembly.genome.fa

real    0m0.670s
user    0m0.561s
sys 0m0.109s

Which seems to say that my vep step is anticipating like output_dir/annotations/test_vcf_data_c21_b1_vep_anno.tsv.

I did a search in my directory, and didn't find a filename like that:

(deeprvat_annotations) pichuan@pichuan-gpu:~/deeprvat/example/annotations$ find ../../ -name test_vcf_data_c21_b1_vep_anno.tsv

I can try to get a new machine and completely from scratch. Do you think that would solve the problem?

Thanks!

Marcel-Mueck commented 3 months ago

Hi @pichuan, looks like the GRCh38.primary_assembly.genome.fa file is missing in your example/annotations/reference directory. You can download it here(don't forget that you have to unzip it (gzip -d)). Alternatively, you can change the fasta_dir : reference line in the config to fasta_dir : ../preprocess/workdir/reference(to use the same fasta dir as in the preprocessing pipeline). Btw, don't forget that you need a GTF file as well in the same directory as the fasta file, which you can download here.

pichuan commented 3 months ago

Thanks @Marcel-Mueck ! I downloaded that in the preprocess step, but didn't quite realize it wasn't pointing to the same one. Thanks for pointing that out. I found it here:

(deeprvat_annotations) pichuan@pichuan-gpu:~/deeprvat/example/annotations$ find ../../ -name GRCh38.primary_assembly.genome.fa
../../example/preprocess/workdir/reference/GRCh38.primary_assembly.genome.fa

It seems like I already have the GTF file:

(deeprvat_annotations) pichuan@pichuan-gpu:~/deeprvat/example/annotations$ ls ../../example/annotations/reference
gencode.v44.annotation.gtf.gz  hg38.fa

For convenience, I just did:

ln -sf $HOME/deeprvat/example/preprocess/workdir/reference/GRCh38.primary_assembly.genome.fa $HOME/deeprvat/example/annotations/reference/GRCh38.primary_assembly.genome.fa 

Then I ran this again:

(deeprvat_annotations) pichuan@pichuan-gpu:~/deeprvat/example/annotations$ time snakemake -j $(nproc) -s ../../pipelines/annotations.snakefile --configfile ../config/deeprvat_annotation_config.yaml --use-conda

This time it made a bit more progress. This one failed:

[Thu Aug  8 17:16:07 2024]
Error in rule vep:
    jobid: 13
    output: output_dir/annotations/test_vcf_data_c21_b1_vep_anno.tsv
    shell:
            vep --input_file output_dir/annotations/tmp/test_vcf_data_c21_b1_stripped.vcf.gz --output_file output_dir/annotations/test_vcf_data_c21_b1_vep_anno.tsv --species homo_sapiens --assembly GRCh38 --format vcf  --offline --cache --dir_cache repo_dir/ensembl-vep/cache --dir_plugins repo_dir/ensembl-vep/Plugins --fork 5 --fasta reference/GRCh38.primary_assembly.genome.fa --tab --total_length --no_escape --polyphen s --sift s --canonical --protein --biotype --force_overwrite --no_stats --per_gene --pick_order biotype,mane_select,mane_plus_clinical,canonical,appris,tsl,ccds,rank,length,ensembl,refseq --plugin CADD,annotation_data/cadd/whole_genome_SNVs.tsv.gz,annotation_data/cadd/gnomad.genomes.r3.0.indel.tsv.gz --plugin Condel,repo_dir/ensembl-vep/Plugin/config/Condel/config,s,2 --plugin AlphaMissense,file=annotation_data/AlphaMissense/AlphaMissense_hg38.tsv.gz
        (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)

From the command above, I wonder if it requires repo_dir to be under the same directory as my current directory, so I tried:

(deeprvat_annotations) pichuan@pichuan-gpu:~/deeprvat/example/annotations$ ln -sf $HOME/repo_dir .
(deeprvat_annotations) pichuan@pichuan-gpu:~/deeprvat/example/annotations$ time snakemake -j $(nproc) -s ../../pipelines/annotations.snakefile --configfile ../config/deeprvat_annotation_config.yaml --use-conda

This time it failed with:

[Thu Aug  8 17:19:44 2024]
Error in rule deepSea:
    jobid: 31
    output: output_dir/annotations/test_vcf_data_c22_b1.CLI.deepseapredict.diff.tsv
    conda-env: kipoi-veff2
    shell:
        kipoi_veff2_predict output_dir/annotations/tmp/test_vcf_data_c22_b1_variants_header.vcf.gz reference/GRCh38.primary_assembly.genome.fa output_dir/annotations/test_vcf_data_c22_b1.CLI.deepseapredict.diff.tsv -l 1000 -m 'DeepSEA/predict' -s 'diff'
        (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)

Then I tried to directly run that command:

(deeprvat_annotations) pichuan@pichuan-gpu:~/deeprvat/example/annotations$ kipoi_veff2_predict output_dir/annotations/tmp/test_vcf_data_c22_b1_variants_header.vcf.gz reference/GRCh38.primary_assembly.genome.fa output_dir/annotations/test_vcf_data_c22_b1.CLI.deepseapredict.diff.tsv -l 1000 -m 'DeepSEA/predict' -s 'diff'

But it told me: kipoi_veff2_predict: command not found

I'm pretty sure previously I installed that on this machine. I'll go back and check. I'll let you know how it goes. But if it's clear what I'm missing here, please let me know.

pichuan commented 3 months ago

I went back and checked:

  1. If I activate the other one: mamba activate kipoi-veff2, then I have that binary in the environment:
    
    $ kipoi_veff2_predict
    Usage: kipoi_veff2_predict [OPTIONS] INPUT_VCF INPUT_FASTA OUTPUT_TSV
    Try 'kipoi_veff2_predict --help' for help.

Error: Missing argument 'INPUT_VCF'.


2. I checked [the full log](https://github.com/user-attachments/files/16551551/2024-08-08T173217.877294.snakemake.log). It seems to have `Activating conda environment: kipoi-veff2`, so I'd expect it to work. However, I'm still getting:

[Thu Aug 8 17:32:48 2024] Error in rule deepSea: jobid: 29 output: output_dir/annotations/test_vcf_data_c21_b1.CLI.deepseapredict.diff.tsv conda-env: kipoi-veff2 shell: kipoi_veff2_predict output_dir/annotations/tmp/test_vcf_data_c21_b1_variants_header.vcf.gz reference/GRCh38.primary_assembly.genome.fa output_dir/annotations/test_vcf_data_c21_b1.CLI.deepseapredict.diff.tsv -l 1000 -m 'DeepSEA/predict' -s 'diff' (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)


3. Then, I decided to try switching the environment myself and see it would run:

(deeprvat_annotations) pichuan@pichuan-gpu:~/deeprvat/example/annotations$ mamba activate kipoi-veff2 (kipoi-veff2) pichuan@pichuan-gpu:~/deeprvat/example/annotations$ kipoi_veff2_predict output_dir/annotations/tmp/test_vcf_data_c21_b1_variants_header.vcf.gz reference/GRCh38.primary_assembly.genome.fa output_dir/annotations/test_vcf_data_c21_b1.CLI.deepseapredict.diff.tsv -l 1000 -m 'DeepSEA/predict' -s 'diff'


Here is the full log from my attempt in step 3:

Adding diff from kipoi_veff2.scores Already up to date. Using downloaded and verified file: /home/pichuan/.kipoi/models/DeepSEA/predict/downloaded/model_files/weights/89e640bf6bdbe1ff165f484d9796efc7 Traceback (most recent call last): File "/home/pichuan/miniforge3/envs/kipoi-veff2/bin/kipoi_veff2_predict", line 33, in sys.exit(load_entry_point('kipoi-veff2', 'console_scripts', 'kipoi_veff2_predict')()) File "/home/pichuan/miniforge3/envs/kipoi-veff2/lib/python3.7/site-packages/click/core.py", line 1130, in call return self.main(args, kwargs) File "/home/pichuan/miniforge3/envs/kipoi-veff2/lib/python3.7/site-packages/click/core.py", line 1055, in main rv = self.invoke(ctx) File "/home/pichuan/miniforge3/envs/kipoi-veff2/lib/python3.7/site-packages/click/core.py", line 1404, in invoke return ctx.invoke(self.callback, ctx.params) File "/home/pichuan/miniforge3/envs/kipoi-veff2/lib/python3.7/site-packages/click/core.py", line 760, in invoke return __callback(args, kwargs) File "/home/pichuan/repo_dir/kipoi-veff2/kipoi_veff2/cli.py", line 154, in score_variants model_config, input_vcf, input_fasta, output_tsv, scoring_function File "/home/pichuan/repo_dir/kipoi-veff2/kipoi_veff2/variant_centered.py", line 236, in score_variants kipoi_model = kipoi.get_model(model_config.model) File "/home/pichuan/miniforge3/envs/kipoi-veff2/lib/python3.7/site-packages/kipoi/model.py", line 184, in get_model mod = infer_pyt_class(md.args)(md.args) File "/home/pichuan/miniforge3/envs/kipoi-veff2/lib/python3.7/site-packages/kipoi/model.py", line 904, in init import torch File "/home/pichuan/miniforge3/envs/kipoi-veff2/lib/python3.7/site-packages/torch/init.py", line 218, in from torch._C import * # noqa: F403 ImportError: /home/pichuan/miniforge3/envs/kipoi-veff2/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so: undefined symbol: iJIT_NotifyEvent



Any advice for me to run that step correctly? Thank you!
Marcel-Mueck commented 3 months ago

Looks like it could have something to do with this pytorch bug. Could you run mamba list|grep mkl inside thekipoi-veff2 env to check which version of mkl is installed in this env? The bug is in version mkl==2024.1.0 you may have to change the version of mklin the kipoi-veff2 env

pichuan commented 3 months ago

Here is what I have:

(kipoi-veff2) pichuan@pichuan-gpu:~/deeprvat/example/annotations$ mamba list|grep mkl
blas                      1.0                         mkl    conda-forge
libblas                   3.9.0            23_linux64_mkl    conda-forge
libcblas                  3.9.0            23_linux64_mkl    conda-forge
liblapack                 3.9.0            23_linux64_mkl    conda-forge
mkl                       2024.1.0           ha957f24_693    conda-forge

Any suggestions on how I change the mld version in that environment? I searched for the files under the directory, but it seems like it was not explicitly set in any config files there:

(kipoi-veff2) pichuan@pichuan-gpu:~/repo_dir/kipoi-veff2$ find . -type f -exec grep -H mkl {} \;
grep: ./.git/objects/pack/pack-a6cbadf6ef55db9b5ff5c1622e55771b320ad96e.pack: binary file matches
pichuan commented 3 months ago

I tried this manually:

(kipoi-veff2) pichuan@pichuan-gpu:~/deeprvat/example/annotations$ mamba install mkl==2024.0

which now seems to work:

(kipoi-veff2) pichuan@pichuan-gpu:~/deeprvat/example/annotations$ kipoi_veff2_predict output_dir/annotations/tmp/test_vcf_data_c21_b1_variants_header.vcf.gz reference/GRCh38.primary_assembly.genome.fa output_dir/annotations/test_vcf_data_c21_b1.CLI.deepseapredict.diff.tsv -l 1000 -m 'DeepSEA/predict' -s 'diff'
Adding diff from kipoi_veff2.scores
Already up to date.
Using downloaded and verified file: /home/pichuan/.kipoi/models/DeepSEA/predict/downloaded/model_files/weights/89e640bf6bdbe1ff165f484d9796efc7
(kipoi-veff2) pichuan@pichuan-gpu:~/deeprvat/example/annotations$ ls -lh output_dir/annotations/test_vcf_data_c21_b1.CLI.deepseapredict.diff.tsv
-rw-rw-r-- 1 pichuan pichuan 124K Aug  9 23:01 output_dir/annotations/test_vcf_data_c21_b1.CLI.deepseapredict.diff.tsv

I'll keep going from here.

It'll be nice to figure out how to have this fix in a config file, or as a step that I write into a script. I guess - worst case I could just write down what I did in a script.

(For context: My hope here is that I can write all the steps in a script, so that eventually I can use this to automatically set up a machine, and eventually so that I can scale this to run on more files automatically (bring up instances, point to an input, set up the environment, etc all through just one script))

pichuan commented 3 months ago

Worklog:

OK, now I'm got around the pytorch error, I ran this command again:

(deeprvat_annotations) pichuan@pichuan-gpu:~/deeprvat/example/annotations$ time snakemake -j $(nproc) -s ../../pipelines/annotations.snakefile --configfile ../config/deeprvat_annotation_config.yaml --use-conda

Here is a next blocker:

[Fri Aug  9 23:05:20 2024]
Error in rule vep:
    jobid: 13
    output: output_dir/annotations/test_vcf_data_c21_b1_vep_anno.tsv
    shell:
            vep --input_file output_dir/annotations/tmp/test_vcf_data_c21_b1_stripped.vcf.gz --output_file output_dir/annotations/test_vcf_data_c21_b1_vep_anno.tsv --species homo_sapiens --assembly GRCh38 --format vcf  --offline --cache --dir_cache repo_dir/ensembl-vep/cache --dir_plugins repo_dir/ensembl-vep/Plugins --fork 5 --fasta reference/GRCh38.primary_assembly.genome.fa --tab --total_length --no_escape --polyphen s --sift s --canonical --protein --biotype --force_overwrite --no_stats --per_gene --pick_order biotype,mane_select,mane_plus_clinical,canonical,appris,tsl,ccds,rank,length,ensembl,refseq --plugin CADD,annotation_data/cadd/whole_genome_SNVs.tsv.gz,annotation_data/cadd/gnomad.genomes.r3.0.indel.tsv.gz --plugin Condel,repo_dir/ensembl-vep/Plugin/config/Condel/config,s,2 --plugin AlphaMissense,file=annotation_data/AlphaMissense/AlphaMissense_hg38.tsv.gz
        (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)

When I ran that command directly, I got this error:

MSG: ERROR: Cache directory repo_dir/ensembl-vep/cache/homo_sapiens not found

I searched for it:

(deeprvat_annotations) pichuan@pichuan-gpu:~/deeprvat/example/annotations$ find -L $HOME/deeprvat  -name cache -type d
/home/pichuan/deeprvat/example/annotations/repo_dir/ensembl-vep/t/testdata/cache

Not exactly sure why, but the path is different. So I did:

sed -i -e 's|vep_cache_dir : repo_dir/ensembl-vep/cache/|vep_cache_dir: repo_dir/ensembl-vep/t/testdata/cache|g' ../config/deeprvat_annotation_config.yaml

and try this again:

(deeprvat_annotations) pichuan@pichuan-gpu:~/deeprvat/example/annotations$ time snakemake -j $(nproc) -s ../../pipelines/annotations.snakefile --configfile ../config/deeprvat_annotation_config.yaml --use-conda

Still seems to have a few errors:

[Fri Aug  9 23:14:14 2024]
Error in rule vep:
    jobid: 13
    output: output_dir/annotations/test_vcf_data_c21_b1_vep_anno.tsv
    shell:
            vep --input_file output_dir/annotations/tmp/test_vcf_data_c21_b1_stripped.vcf.gz --output_file output_dir/annotations/test_vcf_data_c21_b1_vep_anno.tsv --species homo_sapiens --assembly GRCh38 --format vcf  --offline --cache --dir_cache repo_dir/ensembl-vep/t/testdata/cache --dir_plugins repo_dir/ensembl-vep/Plugins --fork 5 --fasta reference/GRCh38.primary_assembly.genome.fa --tab --total_length --no_escape --polyphen s --sift s --canonical --protein --biotype --force_overwrite --no_stats --per_gene --pick_order biotype,mane_select,mane_plus_clinical,canonical,appris,tsl,ccds,rank,length,ensembl,refseq --plugin CADD,annotation_data/cadd/whole_genome_SNVs.tsv.gz,annotation_data/cadd/gnomad.genomes.r3.0.indel.tsv.gz --plugin Condel,repo_dir/ensembl-vep/Plugin/config/Condel/config,s,2 --plugin AlphaMissense,file=annotation_data/AlphaMissense/AlphaMissense_hg38.tsv.gz
        (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)

[Fri Aug  9 23:14:14 2024]
Error in rule vep:
    jobid: 20
    output: output_dir/annotations/test_vcf_data_c22_b1_vep_anno.tsv
    shell:
            vep --input_file output_dir/annotations/tmp/test_vcf_data_c22_b1_stripped.vcf.gz --output_file output_dir/annotations/test_vcf_data_c22_b1_vep_anno.tsv --species homo_sapiens --assembly GRCh38 --format vcf  --offline --cache --dir_cache repo_dir/ensembl-vep/t/testdata/cache --dir_plugins repo_dir/ensembl-vep/Plugins --fork 5 --fasta reference/GRCh38.primary_assembly.genome.fa --tab --total_length --no_escape --polyphen s --sift s --canonical --protein --biotype --force_overwrite --no_stats --per_gene --pick_order biotype,mane_select,mane_plus_clinical,canonical,appris,tsl,ccds,rank,length,ensembl,refseq --plugin CADD,annotation_data/cadd/whole_genome_SNVs.tsv.gz,annotation_data/cadd/gnomad.genomes.r3.0.indel.tsv.gz --plugin Condel,repo_dir/ensembl-vep/Plugin/config/Condel/config,s,2 --plugin AlphaMissense,file=annotation_data/AlphaMissense/AlphaMissense_hg38.tsv.gz
        (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)

[Fri Aug  9 23:14:18 2024]
Error in rule deepSea_PCA:
    jobid: 27
    output: output_dir/annotations/tmp/deepSEA_PCA/deepsea_pca.parquet
    shell:
        mkdir -p output_dir/annotations/tmp/deepSEA_PCA && deeprvat_annotations deepsea-pca output_dir/annotations/all_variants.deepSea.parquet output_dir/annotations/tmp/deepSEA_PCA/pca.npy output_dir/annotations/tmp/deepSEA_PCA/deepSEA_means_SDs.parquet output_dir/annotations/tmp/deepSEA_PCA --n-components 100
        (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)

I ran this directly, and got a few more messages that's useful to follow up on :

(deeprvat_annotations) pichuan@pichuan-gpu:~/deeprvat/example/annotations$ vep --input_file output_dir/annotations/tmp/test_vcf_data_c21_b1_stripped.vcf.gz --output_file output_dir/annotations/test_vcf_data_c21_b1_vep_anno.tsv --species homo_sapiens --assembly GRCh38 --format vcf  --offline --cache --dir_cache repo_dir/ensembl-vep/t/testdata/cache --dir_plugins repo_dir/ensembl-vep/Plugins --fork 5 --fasta reference/GRCh38.primary_assembly.genome.fa --tab --total_length --no_escape --polyphen s --sift s --canonical --protein --biotype --force_overwrite --no_stats --per_gene --pick_order biotype,mane_select,mane_plus_clinical,canonical,appris,tsl,ccds,rank,length,ensembl,refseq --plugin CADD,annotation_data/cadd/whole_genome_SNVs.tsv.gz,annotation_data/cadd/gnomad.genomes.r3.0.indel.tsv.gz --plugin Condel,repo_dir/ensembl-vep/Plugin/config/Condel/config,s,2 --plugin AlphaMissense,file=annotation_data/AlphaMissense/AlphaMissense_hg38.tsv.gz
Smartmatch is experimental at /home/pichuan/miniforge3/envs/deeprvat_annotations/share/ensembl-vep-110.1-0/modules/Bio/EnsEMBL/VEP/AnnotationSource/File.pm line 472.
WARNING: Failed to instantiate plugin CADD: ERROR: Data file annotation_data/cadd/whole_genome_SNVs.tsv.gz not found

WARNING: Failed to compile plugin AlphaMissense: Can't locate AlphaMissense.pm in @INC (you may need to install the AlphaMissense module) (@INC contains: repo_dir/ensembl-vep/Plugins /home/pichuan/miniforge3/envs/deeprvat_annotations/share/ensembl-vep-110.1-0/modules /home/pichuan/miniforge3/envs/deeprvat_annotations/share/ensembl-vep-110.1-0 /home/pichuan/miniforge3/envs/deeprvat_annotations/lib/perl5/5.32/site_perl /home/pichuan/miniforge3/envs/deeprvat_annotations/lib/perl5/site_perl /home/pichuan/miniforge3/envs/deeprvat_annotations/lib/perl5/5.32/vendor_perl /home/pichuan/miniforge3/envs/deeprvat_annotations/lib/perl5/vendor_perl /home/pichuan/miniforge3/envs/deeprvat_annotations/lib/perl5/5.32/core_perl /home/pichuan/miniforge3/envs/deeprvat_annotations/lib/perl5/core_perl .) at (eval 46) line 2.
BEGIN failed--compilation aborted at (eval 46) line 2.

-------------------- EXCEPTION --------------------
MSG: ERROR: No cache found for homo_sapiens, version 110

STACK Bio::EnsEMBL::VEP::CacheDir::dir /home/pichuan/miniforge3/envs/deeprvat_annotations/share/ensembl-vep-110.1-0/modules/Bio/EnsEMBL/VEP/CacheDir.pm:322
STACK Bio::EnsEMBL::VEP::CacheDir::init /home/pichuan/miniforge3/envs/deeprvat_annotations/share/ensembl-vep-110.1-0/modules/Bio/EnsEMBL/VEP/CacheDir.pm:219
STACK Bio::EnsEMBL::VEP::CacheDir::new /home/pichuan/miniforge3/envs/deeprvat_annotations/share/ensembl-vep-110.1-0/modules/Bio/EnsEMBL/VEP/CacheDir.pm:111
STACK Bio::EnsEMBL::VEP::AnnotationSourceAdaptor::get_all_from_cache /home/pichuan/miniforge3/envs/deeprvat_annotations/share/ensembl-vep-110.1-0/modules/Bio/EnsEMBL/VEP/AnnotationSourceAdaptor.pm:116
STACK Bio::EnsEMBL::VEP::AnnotationSourceAdaptor::get_all /home/pichuan/miniforge3/envs/deeprvat_annotations/share/ensembl-vep-110.1-0/modules/Bio/EnsEMBL/VEP/AnnotationSourceAdaptor.pm:92
STACK Bio::EnsEMBL::VEP::BaseRunner::get_all_AnnotationSources /home/pichuan/miniforge3/envs/deeprvat_annotations/share/ensembl-vep-110.1-0/modules/Bio/EnsEMBL/VEP/BaseRunner.pm:170
STACK Bio::EnsEMBL::VEP::Runner::init /home/pichuan/miniforge3/envs/deeprvat_annotations/share/ensembl-vep-110.1-0/modules/Bio/EnsEMBL/VEP/Runner.pm:128
STACK Bio::EnsEMBL::VEP::Runner::run /home/pichuan/miniforge3/envs/deeprvat_annotations/share/ensembl-vep-110.1-0/modules/Bio/EnsEMBL/VEP/Runner.pm:200
STACK toplevel /home/pichuan/miniforge3/envs/deeprvat_annotations/bin/vep:46
Date (localtime)    = Fri Aug  9 23:15:30 2024
Ensembl API version = 110
---------------------------------------------------

It seems like I don't have whole_genome_SNVs.tsv.gz, so I looked for it. It's because I didn't have annotation_data in the right place, so I just linked it and reran:

(deeprvat_annotations) pichuan@pichuan-gpu:~/deeprvat/example/annotations$ find $HOME -type f -name whole_genome_SNVs.tsv.gz*
/home/pichuan/annotation_data/cadd/whole_genome_SNVs.tsv.gz
/home/pichuan/annotation_data/cadd/whole_genome_SNVs.tsv.gz.tbi
(deeprvat_annotations) pichuan@pichuan-gpu:~/deeprvat/example/annotations$ ln -sf $HOME/annotation_data .

Try again:

(deeprvat_annotations) pichuan@pichuan-gpu:~/deeprvat/example/annotations$ vep --input_file output_dir/annotations/tmp/test_vcf_data_c21_b1_stripped.vcf.gz --output_file output_dir/annotations/test_vcf_data_c21_b1_vep_anno.tsv --species homo_sapiens --assembly GRCh38 --format vcf  --offline --cache --dir_cache repo_dir/ensembl-vep/t/testdata/cache --dir_plugins repo_dir/ensembl-vep/Plugins --fork 5 --fasta reference/GRCh38.primary_assembly.genome.fa --tab --total_length --no_escape --polyphen s --sift s --canonical --protein --biotype --force_overwrite --no_stats --per_gene --pick_order biotype,mane_select,mane_plus_clinical,canonical,appris,tsl,ccds,rank,length,ensembl,refseq --plugin CADD,annotation_data/cadd/whole_genome_SNVs.tsv.gz,annotation_data/cadd/gnomad.genomes.r3.0.indel.tsv.gz --plugin Condel,repo_dir/ensembl-vep/Plugin/config/Condel/config,s,2 --plugin AlphaMissense,file=annotation_data/AlphaMissense/AlphaMissense_hg38.tsv.gz
Smartmatch is experimental at /home/pichuan/miniforge3/envs/deeprvat_annotations/share/ensembl-vep-110.1-0/modules/Bio/EnsEMBL/VEP/AnnotationSource/File.pm line 472.
WARNING: Failed to compile plugin AlphaMissense: Can't locate AlphaMissense.pm in @INC (you may need to install the AlphaMissense module) (@INC contains: repo_dir/ensembl-vep/Plugins /home/pichuan/miniforge3/envs/deeprvat_annotations/share/ensembl-vep-110.1-0/modules /home/pichuan/miniforge3/envs/deeprvat_annotations/share/ensembl-vep-110.1-0 /home/pichuan/miniforge3/envs/deeprvat_annotations/lib/perl5/5.32/site_perl /home/pichuan/miniforge3/envs/deeprvat_annotations/lib/perl5/site_perl /home/pichuan/miniforge3/envs/deeprvat_annotations/lib/perl5/5.32/vendor_perl /home/pichuan/miniforge3/envs/deeprvat_annotations/lib/perl5/vendor_perl /home/pichuan/miniforge3/envs/deeprvat_annotations/lib/perl5/5.32/core_perl /home/pichuan/miniforge3/envs/deeprvat_annotations/lib/perl5/core_perl .) at (eval 46) line 2.
BEGIN failed--compilation aborted at (eval 46) line 2.

So, I think next I'll need to install AlphaMissense. I'll try that and see if I can proceed.

pichuan commented 3 months ago

Worklog : install AlphaMissense

I went back and did this:

(deeprvat_annotations) pichuan@pichuan-gpu:~/repo_dir/ensembl-vep$ perl INSTALL.pl

I went through the steps, and made sure to install this Plugins this time:

  34: AlphaMissense            - Annotates missense variants with the pre-computed AlphaMissense pathogenicity scores. AlphaMissense is a deep learning model developed by Google DeepMind that predicts the pathogenicity of single nucleotide missense variants.

Log:

 - installing "AlphaMissense"
AlphaMissense already installed; do you want to overwrite (probably OK if updating) (y/n)? y
 - This plugin requires data
 - See /home/pichuan/.vep/Plugins/AlphaMissense.pm for details
 - OK

NB: One or more plugins that you have installed will not work without installation or downloading data; see logs above

All done

Trying again:

(deeprvat_annotations) pichuan@pichuan-gpu:~/deeprvat/example/annotations$ time snakemake -j $(nproc) -s ../../pipelines/annotations.snakefile --configfile ../config/deeprvat_annotation_config.yaml --use-conda

This is when I realize the problem is where the Plugins is located. Instead of vep_plugin_dir : repo_dir/ensembl-vep/Plugins, it seems like I should have vep_plugin_dir : /home/pichuan/.vep/Plugins. Let me try that:

(deeprvat_annotations) pichuan@pichuan-gpu:~/deeprvat/example/annotations$ sed -i -e 's|vep_plugin_dir : repo_dir/ensembl-vep/Plugins|vep_plugin_dir : /home/pichuan/.vep/Plugins|g' ../config/deeprvat_annotation_config.yaml

while doing that, I think this path also looks wrong:

  condel: Condel,repo_dir/ensembl-vep/Plugin/config/Condel/config,s,2

So I changed that too:

sed -i -e 's|condel: Condel,repo_dir/ensembl-vep/Plugin/config/Condel/config,s,2|condel: Condel,/home/pichuan/miniforge3/pkgs/ensembl-vep-110.1-pl5321h2a3209d_0/share/ensembl-vep-110.1-0/config/Condel/config,s,2|g' ../config/deeprvat_annotation_config.yaml

Try the vep command again, it seems like I'm missing this file:

WARNING: Failed to instantiate plugin AlphaMissense: ERROR: Data file annotation_data/AlphaMissense/AlphaMissense_hg38.tsv.gz not found

Hmm, when I install AlphaMissense earlier, it did say that One or more plugins that you have installed will not work without installation or downloading data. So I read /home/pichuan/.vep/Plugins/AlphaMissense.pm -> it mentioned AlphaMissense_hg38.tsv.gz but didn't really mention where to download it... so I just searched, and found https://www.biostars.org/p/9580275/

so I did:

wget https://storage.googleapis.com/dm_alphamissense/AlphaMissense_hg38.tsv.gz
tabix -s 1 -b 2 -e 2 -f -S 1 AlphaMissense_hg38.tsv.gz
mkdir -p annotation_data/AlphaMissense # I ln -sf annotation_data here earlier.
mv AlphaMissense_hg38.tsv.gz* annotation_data/AlphaMissense/

OK, now I tried again:

(deeprvat_annotations) pichuan@pichuan-gpu:~/deeprvat/example/annotations$ time snakemake -j $(nproc) -s ../../pipelines/annotations.snakefile --configfile ../config/deeprvat_annotation_config.yaml --use-conda

Still having some error at the vep step, so I tried directly running that step:

(deeprvat_annotations) pichuan@pichuan-gpu:~/deeprvat/example/annotations$ vep --input_file output_dir/annotations/tmp/test_vcf_data_c21_b1_stripped.vcf.gz --output_file output_dir/annotations/test_vcf_data_c21_b1_vep_anno.tsv --species homo_sapiens --assembly GRCh38 --format vcf  --offline --cache --dir_cache repo_dir/ensembl-vep/t/testdata/cache --dir_plugins /home/pichuan/.vep/Plugins --fork 5 --fasta reference/GRCh38.primary_assembly.genome.fa --tab --total_length --no_escape --polyphen s --sift s --canonical --protein --biotype --force_overwrite --no_stats --per_gene --pick_order biotype,mane_select,mane_plus_clinical,canonical,appris,tsl,ccds,rank,length,ensembl,refseq --plugin CADD,annotation_data/cadd/whole_genome_SNVs.tsv.gz,annotation_data/cadd/gnomad.genomes.r3.0.indel.tsv.gz --plugin Condel,/home/pichuan/miniforge3/pkgs/ensembl-vep-110.1-pl5321h2a3209d_0/share/ensembl-vep-110.1-0/config/Condel/config,s,2 --plugin AlphaMissense,file=annotation_data/AlphaMissense/AlphaMissense_hg38.tsv.gz

Here is the error:

Smartmatch is experimental at /home/pichuan/miniforge3/envs/deeprvat_annotations/share/ensembl-vep-110.1-0/modules/Bio/EnsEMBL/VEP/AnnotationSource/File.pm line 472.
readline() on closed filehandle SIFT at /home/pichuan/.vep/Plugins/Condel.pm line 160.
readline() on closed filehandle POLYPHEN at /home/pichuan/.vep/Plugins/Condel.pm line 171.

-------------------- EXCEPTION --------------------
MSG: ERROR: No cache found for homo_sapiens, version 110

STACK Bio::EnsEMBL::VEP::CacheDir::dir /home/pichuan/miniforge3/envs/deeprvat_annotations/share/ensembl-vep-110.1-0/modules/Bio/EnsEMBL/VEP/CacheDir.pm:322
STACK Bio::EnsEMBL::VEP::CacheDir::init /home/pichuan/miniforge3/envs/deeprvat_annotations/share/ensembl-vep-110.1-0/modules/Bio/EnsEMBL/VEP/CacheDir.pm:219
STACK Bio::EnsEMBL::VEP::CacheDir::new /home/pichuan/miniforge3/envs/deeprvat_annotations/share/ensembl-vep-110.1-0/modules/Bio/EnsEMBL/VEP/CacheDir.pm:111
STACK Bio::EnsEMBL::VEP::AnnotationSourceAdaptor::get_all_from_cache /home/pichuan/miniforge3/envs/deeprvat_annotations/share/ensembl-vep-110.1-0/modules/Bio/EnsEMBL/VEP/AnnotationSourceAdaptor.pm:116
STACK Bio::EnsEMBL::VEP::AnnotationSourceAdaptor::get_all /home/pichuan/miniforge3/envs/deeprvat_annotations/share/ensembl-vep-110.1-0/modules/Bio/EnsEMBL/VEP/AnnotationSourceAdaptor.pm:92
STACK Bio::EnsEMBL::VEP::BaseRunner::get_all_AnnotationSources /home/pichuan/miniforge3/envs/deeprvat_annotations/share/ensembl-vep-110.1-0/modules/Bio/EnsEMBL/VEP/BaseRunner.pm:170
STACK Bio::EnsEMBL::VEP::Runner::init /home/pichuan/miniforge3/envs/deeprvat_annotations/share/ensembl-vep-110.1-0/modules/Bio/EnsEMBL/VEP/Runner.pm:128
STACK Bio::EnsEMBL::VEP::Runner::run /home/pichuan/miniforge3/envs/deeprvat_annotations/share/ensembl-vep-110.1-0/modules/Bio/EnsEMBL/VEP/Runner.pm:200
STACK toplevel /home/pichuan/miniforge3/envs/deeprvat_annotations/bin/vep:46
Date (localtime)    = Fri Aug  9 23:46:14 2024
Ensembl API version = 110
---------------------------------------------------

@Marcel-Mueck : question for you: Is it clear to you what this latest error indicates that I'm missing?

MSG: ERROR: No cache found for homo_sapiens, version 110 seems suspicious. I do have a homo_sapiens directory under cache, but maybe the directory was wrong? Here is what I have:

(deeprvat_annotations) pichuan@pichuan-gpu:~/deeprvat/example/annotations$ ls repo_dir/ensembl-vep/t/testdata/cache/homo_sapiens
84_GRCh38

Thank you!

One follow up on this: After reading https://github.com/Ensembl/ensembl-vep/issues/1716 , I tried adding --cache_version 84 to my vep command. That seems to have worked. However, I'm confused why my cache has version 84 while the default cache was a different version.

@Marcel-Mueck : I'd love to know your thoughts on whether using --cache_version 84 is the right way to go, or if I should somehow update the cache version (not sure how). Thanks!

Marcel-Mueck commented 3 months ago

Dear @pichuan regarding the alpha missense plugin data, the link you got the data from looks correct, it looks like the same link as in the annotation docs: https://deeprvat.readthedocs.io/en/latest/annotations.html#requirements (under AlphaMissense). Regarding the VEP cache, VEP strongly suggests using the same cache version as the VEP software installed (as mentioned here): image So if you use version 110 of vep (as was used for the paper version of deeprvat) you would have to download (and unpack ) this cache inside repo_dir/ensembl-vep/cache: https://ftp.ensembl.org/pub/release-110/variation/indexed_vep_cache/homo_sapiens_vep_110_GRCh38.tar.gz

pichuan commented 3 months ago

Thanks. In my case, somehow my cache directory is repo_dir/ensembl-vep/t/testdata/cache, so I did:

(deeprvat_annotations) pichuan@pichuan-gpu:~/repo_dir/ensembl-vep/t/testdata/cache$ aria2c -c -x10 -s10 -d . https://ftp.ensembl.org/pub/release-110/variation/indexed_vep_cache/homo_sapiens_vep_110_GRCh38.tar.gz
(deeprvat_annotations) pichuan@pichuan-gpu:~/repo_dir/ensembl-vep/t/testdata/cache$ tar xvfz homo_sapiens_vep_110_GRCh38.tar.gz

Now I have:

(deeprvat_annotations) pichuan@pichuan-gpu:~/repo_dir/ensembl-vep/t/testdata/cache$ ls homo_sapiens
110_GRCh38  84_GRCh38

After that, I was able to run the following command without --cache_version 84!

(deeprvat_annotations) pichuan@pichuan-gpu:~/deeprvat/example/annotations$ vep --input_file output_dir/annotations/tmp/test_vcf_data_c21_b1_stripped.vcf.gz --output_file output_dir/annotations/test_vcf_data_c21_b1_vep_anno.tsv --species homo_sapiens --assembly GRCh38 --format vcf  --offline --cache --dir_cache repo_dir/ensembl-vep/t/testdata/cache --dir_plugins /home/pichuan/.vep/Plugins --fork 5 --fasta reference/GRCh38.primary_assembly.genome.fa --tab --total_length --no_escape --polyphen s --sift s --canonical --protein --biotype --force_overwrite --no_stats --per_gene --pick_order biotype,mane_select,mane_plus_clinical,canonical,appris,tsl,ccds,rank,length,ensembl,refseq --plugin CADD,annotation_data/cadd/whole_genome_SNVs.tsv.gz,annotation_data/cadd/gnomad.genomes.r3.0.indel.tsv.gz --plugin Condel,/home/pichuan/miniforge3/pkgs/ensembl-vep-110.1-pl5321h2a3209d_0/share/ensembl-vep-110.1-0/config/Condel/config,s,2 --plugin AlphaMissense,file=annotation_data/AlphaMissense/AlphaMissense_hg38.tsv.gz

Thanks. Now I'm having a different errors in the deeprvat_annotations step. I'll continue that error report in https://github.com/PMBio/deeprvat/issues/121, and I'll close this one for now.