WrightonLabCSU / DRAM

Distilled and Refined Annotation of Metabolism: A tool for the annotation and curation of function for microbial and viral genomes
GNU General Public License v3.0
241 stars 50 forks source link

DRAMv via Snakemake #107

Closed shaman-narayanasamy closed 2 years ago

shaman-narayanasamy commented 2 years ago

Dear authors,

I am attempting to run the entire viral identification SOP on the test dataset.

I am implementing workflow with Snakemake. To ensure reproducibility, I am running the protocol using Snakemake's --use-conda parameter (with the relevant conda environment .yml files, ofcourse). When running interactively, DRAMv works with no issues:

$ cd myworkdirectory
$ DRAM-v.py annotate 
-i VS2/Virsorter2/pass_2/for-dramv/final-viral-combined-for-dramv.fa -v VS2/Virsort
er2/pass_2/for-dramv/viral-affi-contigs-for-dramv.tab -o VS2/Virsorter2/dramv-annot
ate --skip_trnascan --threads 6 --min_contig_size 1000
2021-08-09 16:16:20.025052: Viral annotation started
0:00:00.160339: Retrieved database locations and descriptions
0:00:00.160439: Annotating 1__viral_gt_0__full-cat_2
0:00:00.332787: Turning genes from prodigal to mmseqs2 db
0:00:02.490048: Getting hits from kofam
... <skipped> ...
0:11:11.905275: Merging ORF annotations
/home/users/snarayanasamy/miniconda3/envs/vs2/lib/python3.8/site-packages/mag_annotator/annotate_bins.py:578: UserWarning: No rRNAs were detected, no rrnas.tsv file will be created.
  warnings.warn('No rRNAs were detected, no rrnas.tsv file will be created.')
0:11:12.515578: Annotations complete, processing annotations
0:11:12.585182: Annotations complete, assigning auxiliary scores and flags
/home/users/snarayanasamy/miniconda3/envs/vs2/lib/python3.8/site-packages/mag_annotator/annotate_vgfs.py:135: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  virsorter_genes['start_position'] = virsorter_genes['start_position'].astype(int)
/home/users/snarayanasamy/miniconda3/envs/vs2/lib/python3.8/site-packages/mag_annotator/annotate_vgfs.py:136: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  virsorter_genes['end_position'] = virsorter_genes['end_position'].astype(int)
0:11:12.952644: Completed annotations

However, when running it within a Snakemake workflow, it seems to fail:

$ cd mysnakemakedirectory
$ TS_DIR="/scratch/users/snarayanasamy/test_data/Virsorter2/Assemblies"
TS_SAMPLES="VS2" MGE_OUTDIR="/scratch/users/snarayanasamy/test_data/Virsorter2/resu
lts" snakemake -j 6 --use-conda -ps workflows/MgePrediction virsorter2_prediction_w
orkflow.done
Building DAG of jobs...
Using shell: /usr/bin/bash
Provided cores: 6
Rules claiming more threads will be scaled down.
Job counts:
        count   jobs
        1       VS2_PREDICTION
        1       run_dramv
        2

[Mon Aug  9 16:33:48 2021]
rule run_dramv:
    input: VS2/Virsorter2/pass_2/for-dramv/final-viral-combined-for-dramv.fa, VS2/V
irsorter2/pass_2/for-dramv/viral-affi-contigs-for-dramv.tab
    output: VS2/Virsorter2/dramv-annotate/annotations.tsv, VS2/Virsorter2/dramv-ann
otate, VS2/Virsorter2/dramv-distill
    log: logs/VS2/dramv.log
    jobid: 2
    benchmark: benchmarks/VS2/dramv.txt
    wildcards: ts_sample=VS2
    threads: 6

        DRAM-v.py annotate -i VS2/Virsorter2/pass_2/for-dramv/final-viral-combined-
for-dramv.fa -v VS2/Virsorter2/pass_2/for-dramv/viral-affi-contigs-for-dramv.tab -o
 VS2/Virsorter2/dramv-annotate --skip_trnascan --threads 6 --min_contig_size 1000

        #DRAM-v.py distill -i VS2/Virsorter2/dramv-annotate/annotations.tsv -o VS2/Virsorter2/dramv-distill

Activating conda environment: /mnt/lscratch/users/snarayanasamy/test_data/Virsorter2/results/.snakemake/conda/79e3c284e88181bbf3ed7bb38e53cda9
2021-08-09 16:34:40.717025: Viral annotation started
Traceback (most recent call last):
  File "/mnt/lscratch/users/snarayanasamy/test_data/Virsorter2/results/.snakemake/conda/79e3c284e88181bbf3ed7bb38e53cda9/bin/DRAM-v.py", line 140, in <module>
    args.func(**args_dict)
  File "/mnt/lscratch/users/snarayanasamy/test_data/Virsorter2/results/.snakemake/conda/79e3c284e88181bbf3ed7bb38e53cda9/lib/python3.9/site-packages/mag_annotator/annotate_vgfs.py", line 383, in annotate_vgfs
    mkdir(output_dir)
FileExistsError: [Errno 17] File exists: 'VS2/Virsorter2/dramv-annotate'
[Mon Aug  9 16:34:41 2021]
Error in rule run_dramv:
    jobid: 2
    output: VS2/Virsorter2/dramv-annotate/annotations.tsv, VS2/Virsorter2/dramv-annotate, VS2/Virsorter2/dramv-distill
    log: logs/VS2/dramv.log (check log file(s) for error message)
    conda-env: /mnt/lscratch/users/snarayanasamy/test_data/Virsorter2/results/.snakemake/conda/79e3c284e88181bbf3ed7bb38e53cda9
    shell:

        DRAM-v.py annotate -i VS2/Virsorter2/pass_2/for-dramv/final-viral-combined-for-dramv.fa -v VS2/Virsorter2/pass_2/for-dramv/viral-affi-contigs-for-dramv.tab -o VS2/Virsorter2/dramv-annotate --skip_trnascan --threads 6 --min_contig_size 1000

        #DRAM-v.py distill -i VS2/Virsorter2/dramv-annotate/annotations.tsv -o VS2/Virsorter2/dramv-distill

        (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)

Removing output files of failed job run_dramv since they might be corrupted:
VS2/Virsorter2/dramv-annotate
Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message
Complete log: /mnt/irisgpfs/users/snarayanasamy/repositories/gitlab/LAO_multiomics_CRISPR_iMGEs/.snakemake/log/2021-08-09T163348.739113.snakemake.log

I also attempted loading the conda environment and launching without the --use-conda parameter.

Any idea on why this happens and if there is a way to circumvent this issue?

Best regards, Shaman

rmFlynn commented 2 years ago

It looks l DRAM-v wants to make the folder 'VS2/Virsorter2/dramv-annotate' its self, it does not understand that you did this work for it. If you make the output more like folder('VS2/Virsorter2/dramv-annotate'), other_output the folder should be removed by Snakemakes cleanup process, thus preventing the file exists error. Let me know if that helps.

shaman-narayanasamy commented 2 years ago

Hi @rmFlynn,

Thanks for the response quick response!

Indeed, I used the directory as a Snakemake output. I now removed those folders from the output directive and here is the error I currently receive:

$ TS_DIR="/scratch/users/snarayanasamy/test_data/Virsorter2/Assemblies"
TS_SAMPLES="VS2" MGE_OUTDIR="/scratch/users/snarayanasamy/test_data/Virsorter2/resu
lts" snakemake -j 6 -ps workflows/MgePrediction virsorter2_prediction_workflow.done
Building DAG of jobs...
Using shell: /usr/bin/bash
Provided cores: 6
Rules claiming more threads will be scaled down.
Conda environments: ignored
Job counts:
        count   jobs
        1       VS2_PREDICTION
        1       run_dramv
        2

[Mon Aug  9 20:06:44 2021]
rule run_dramv:
    input: VS2/Virsorter2/pass_2/for-dramv/final-viral-combined-for-dramv.fa, VS2/V
irsorter2/pass_2/for-dramv/viral-affi-contigs-for-dramv.tab
    output: VS2/Virsorter2/dramv-annotate/annotations.tsv
    log: logs/VS2/dramv.log
    jobid: 2
    benchmark: benchmarks/VS2/dramv.txt
    wildcards: ts_sample=VS2
    threads: 6

        DRAM-v.py annotate -i VS2/Virsorter2/pass_2/for-dramv/final-viral-combined-for-dramv.fa -v VS2/Virsorter2/pass_2/for-dramv/viral-affi-contigs-for-dramv.tab -o VS2/Virsorter2/dramv-annotate --skip_trnascan --threads 6 --min_contig_size 1000

        DRAM-v.py distill -i VS2/Virsorter2/dramv-annotate/annotations.tsv -o VS2/Virsorter2/dramv-distill

2021-08-09 20:06:45.583115: Viral annotation started
Traceback (most recent call last):
  File "/home/users/snarayanasamy/miniconda3/envs/vs2/bin/DRAM-v.py", line 140, in <module>
    args.func(**args_dict)
  File "/home/users/snarayanasamy/miniconda3/envs/vs2/lib/python3.8/site-packages/mag_annotator/annotate_vgfs.py", line 383, in annotate_vgfs
    mkdir(output_dir)
FileExistsError: [Errno 17] File exists: 'VS2/Virsorter2/dramv-annotate'
[Mon Aug  9 20:06:45 2021]
Error in rule run_dramv:
    jobid: 2
    output: VS2/Virsorter2/dramv-annotate/annotations.tsv
    log: logs/VS2/dramv.log (check log file(s) for error message)
    shell:

        DRAM-v.py annotate -i VS2/Virsorter2/pass_2/for-dramv/final-viral-combined-for-dramv.fa -v VS2/Virsorter2/pass_2/for-dramv/viral-affi-contigs-for-dramv.tab -o VS2/Virsorter2/dramv-annotate --skip_trnascan --threads 6 --min_contig_size 1000

        DRAM-v.py distill -i VS2/Virsorter2/dramv-annotate/annotations.tsv -o VS2/Virsorter2/dramv-distill

        (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)

Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message
Complete log: /mnt/irisgpfs/users/snarayanasamy/repositories/gitlab/LAO_multiomics_CRISPR_iMGEs/.snakemake/log/2021-08-09T200644.023453.snakemake.log

It looks like folders are created based on the path of the output files given to Snakemake and not only based on the directory("path/to/folder")

Therefore, I appended rm -rf {wildcards.ts_sample}/Virsorter2/dramv-annotate {wildcards.ts_sample}/Virsorter2/dramv-distill within my rule, just before the DRAM-v.py command, which fixes the issue. Not really liking this solution, though. According to this post on Stackoverflow, this is the way one would handle it.

I can also confirm that it works with the --use-conda parameter.

Let me know if you have any further question :)