blobtoolkit / pipeline

[Archived] SnakeMake pipeline to run BlobTools on public assemblies
https://blobtoolkit.genomehubs.org
MIT License
10 stars 4 forks source link

KeyError: run_busco_v5 #15

Closed jgnunes closed 3 years ago

jgnunes commented 3 years ago

While running the pipeline (v2.6.1) on a cluster, I had the following error at the BUSCO step:

Creating specified working directory /lustre/scratch116/vr/projects/vgp/user/jf18/blobtoolkit/data/limnoperna_fortunei_LF6/busco.    
Building DAG of jobs...                                                                                                            
Traceback (most recent call last):                                                                                                     
    File "/software/team311/jf18/miniconda3/envs/btk_env_2/lib/python3.8/site-packages/snakemake/__init__.py", line 701, in snakemake    
        success = workflow.execute(                                                                                                        
    File "/software/team311/jf18/miniconda3/envs/btk_env_2/lib/python3.8/site-packages/snakemake/workflow.py", line 730, in execute      
        dag.init()                                                                                                                       
    File "/software/team311/jf18/miniconda3/envs/btk_env_2/lib/python3.8/site-packages/snakemake/dag.py", line 190, in init              
        self.cleanup()                                                                                                                   
    File "/software/team311/jf18/miniconda3/envs/btk_env_2/lib/python3.8/site-packages/snakemake/dag.py", line 258, in cleanup           
        del self.depending[dep][job]                                                                                                     
KeyError: run_busco_v5

I started the pipeline with the following command:

echo "snakemake -p -j $THREADS --directory $DATA_DIR/$ACCESSION/$TOOL --configfile $DATA_DIR/$ACCESSION/config.yaml --latency-wait 60 --stats $DATA_DIR/$ACCESSION/$TOOL.stats -s $SNAKE_DIR/$TOOL.smk" | bsub -n 32 -q basement -R"span[hosts=1] select[mem>70000] rusage[mem=70000]" -M70000 -P team311 -o btk_golmus.%J -e er.btk_golmus.%J

And this is my current tree of files after the failed run:

.
├── assembly
│   └── lf6.discovar.fasta.gz
├── blobtoolkit
│   └── logs
│       ├── busco
│       │   └── run_sub_pipeline.log
│       ├── chunk_stats
│       │   ├── run_sub_pipeline.benchmark.txt
│       │   └── run_sub_pipeline.log
│       ├── minimap
│       │   ├── run_sub_pipeline.benchmark.txt
│       │   └── run_sub_pipeline.log
│       └── windowmasker
│           ├── run_sub_pipeline.benchmark.txt
│           └── run_sub_pipeline.log
├── btk.lfor.LF6.946043
├── busco
├── chunk_stats
│   ├── limnoperna_fortunei_LF6.chunk_stats.mask.bed
│   ├── limnoperna_fortunei_LF6.chunk_stats.tsv
│   └── logs
│       └── limnoperna_fortunei_LF6
│           ├── get_seq_stats.benchmark.txt
│           └── get_seq_stats.log
├── chunk_stats.stats
├── config.yaml
├── er.btk.lfor.LF6.946043
├── minimap
│   ├── limnoperna_fortunei_LF6.LF6-A_GTGAAA_L001.bam
│   ├── limnoperna_fortunei_LF6.LF6-A_GTGAAA_L001.bam.csi
│   ├── limnoperna_fortunei_LF6.sr.mmi
│   └── logs
│       └── limnoperna_fortunei_LF6
│           ├── run_minimap2_align
│           │   ├── LF6-A_GTGAAA_L001.benchmark.txt
│           │   └── LF6-A_GTGAAA_L001.log
│           └── run_minimap2_index
│               ├── sr.benchmark.txt
│               └── sr.log
├── minimap.stats
├── reads
│   ├── LF6-A_GTGAAA_L001_R1_001.fastq.gz
│   └── LF6-A_GTGAAA_L001_R2_001.fastq.gz
├── windowmasker
│   ├── limnoperna_fortunei_LF6.windowmasker.counts
│   ├── limnoperna_fortunei_LF6.windowmasker.fasta
│   └── logs
│       └── limnoperna_fortunei_LF6
│           ├── run_windowmasker.benchmark.txt
│           ├── run_windowmasker.log
│           ├── unzip_assembly_fasta.benchmark.txt
│           └── unzip_assembly_fasta.log
└── windowmasker.stats

Any idea on what may be happening here? Let me know if you need any further information.

rjchallis commented 3 years ago

I haven't seen this error before, but it may be happening because the pipeline now requires a full BUSCO directory setup in order to run in offline mode (previously just the lineages were needed), use something like this to set it up if you haven't already:

BUSCO=/volumes/databases/busco_2021_06
cd $BUSCO
wget -r https://busco-data.ezlab.org/v5/data
find busco-data.ezlab.org -name "*.tar.gz" | parallel "cd {//}; tar -xzf {/}"

The actual error is a little confusing as Snakemake should be catching KeyErrors in the cleanup function. I've been developing this with Snakemake v6.0.5, so if you have an older version that could also be contributing to this issue.

jgnunes commented 3 years ago

In fact I was using an incomplete BUSCO directory. However I downloaded the complete version and tried to restart the pipeline with the same working directory (so that I dind't need to rerun the previous steps). However I'm still having the same error. I'm using snakemake v6.4.1.

rjchallis commented 3 years ago

Could you post your config.yaml in case there is anything about the busco section that could be causing this? I'm still a little confused as it doesn't look like the usual errors caused by config problems and the rule isn't even starting, so that seems to rule out problems with running BUSCO. Could you try running this with Snakemake 6.0.5 to test if that makes a difference?

jgnunes commented 3 years ago

This is my config.yaml:

assembly:
  file: /lustre/scratch116/vr/projects/vgp/user/jf18/blobtoolkit/data/limnoperna_fortunei_LF6/assembly/lf6.discovar.fasta.gz 
  prefix: limnoperna_fortunei_LF6
busco:
  download_dir: /lustre/scratch116/vr/projects/vgp/user/jf18/blobtoolkit/databases/busco_2021_06
  lineages:
    - mollusca_odb10
    - eukaryota_odb10
  basal_lineages:
    - eukaryota_odb10
reads:
  paired:
    - prefix: LF6-A_GTGAAA_L001 
      platform: ILLUMINA
      file: /lustre/scratch116/vr/projects/vgp/user/jf18/blobtoolkit/data/limnoperna_fortunei_LF6/reads/LF6-A_GTGAAA_L001_R1_001.fastq.gz;/lustre/scratch116/vr/projects/vgp/user/jf18/blobtoolkit/data/limnoperna_fortunei_LF6/reads/LF6-A_GTGAAA_L001_R2_001.fastq.gz 
revision: 0
settings:
  blast_chunk: 100000
  blast_max_chunks: 10
  blast_overlap: 0
  blast_min_length: 1000
  taxdump: /software/grit/projects/btk/blobplot_db/taxonomy 
  tmp: /tmp
similarity:
  defaults:
    evalue: 1.0e-10
    import_evalue: 1.0e-25
    max_target_seqs: 10
    taxrule: bestdistorder
  diamond_blastx:
    name: reference_proteomes
    path: /software/grit/projects/btk/blobplot_db/uniprot_2019_02 
  diamond_blastp:
    name: reference_proteomes
    path: /software/grit/projects/btk/blobplot_db/uniprot_2019_02
    import_max_target_seqs: 100000
  blastn:
    name: nt
    path: /software/grit/projects/btk/blobplot_db/ncbi_2019_08
taxon:
  name: Limnoperna fortunei
  taxid: '356393'
version: 1

Sure, I will try to re-run the pipeline using Snakemake 6.0.5 and let you know once I do it.

jgnunes commented 3 years ago

I've created a new conda environment with snakemake v6.0.5 and the error is gone (now running the BUSCO step properly). However I don't think this is some incompatibility with v6.4.1 because I just realized I had already run blobtools (at my local machine) with v6.4.1 and haven't had any problems with BUSCO. So my guess is that this issue may have been caused by some problem during conda environment setting up, which has been solved with a new enviroment installation.

Anyway, thanks for the help! I'm closing this issue.