blobtoolkit / pipeline

[Archived] SnakeMake pipeline to run BlobTools on public assemblies
https://blobtoolkit.genomehubs.org
MIT License
10 stars 4 forks source link

[v2.6.0] Windowmasker KeyError in line 6 of unzip_assembly_fasta.smk #13

Closed kubu4 closed 3 years ago

kubu4 commented 3 years ago

I'm attempting to run the pipeline (on an HPC cluster) and the windowmasker step is throwing the following error message:

Creating specified working directory /gscratch/scrubbed/samwhite/outputs/20210623_pgen_blobtools_Panopea-generosa-v1.0/windowmasker.
The flag 'temp' used in rule run_windowmasker is only valid for outputs, not inputs.
KeyError in line 6 of /gscratch/srlab/programs/blobtoolkit-v2.6.0/pipeline/rules/unzip_assembly_fasta.smk:
'file'
  File "/gscratch/srlab/programs/blobtoolkit-v2.6.0/pipeline/windowmasker.smk", line 42, in <module>
  File "/gscratch/srlab/programs/blobtoolkit-v2.6.0/pipeline/rules/unzip_assembly_fasta.smk", line 6, in <module>

Here's what my script looks like:

#!/bin/bash
## Job Name
#SBATCH --job-name=20210623_pgen_blobtools_Panopea-generosa-v1.0
## Allocation Definition
#SBATCH --account=coenv
#SBATCH --partition=coenv
## Resources
## Nodes
#SBATCH --nodes=1
## Walltime (days-hours:minutes:seconds format)
#SBATCH --time=10-00:00:00
## Memory per node
#SBATCH --mem=200G
##turn on e-mail notification
#SBATCH --mail-type=ALL
#SBATCH --mail-user=samwhite@uw.edu
## Specify the working directory for this job
#SBATCH --chdir=/gscratch/scrubbed/samwhite/outputs/20210623_pgen_blobtools_Panopea-generosa-v1.0

### Script to run the Blobtools2 Pipeline
### on trimmed 10x Genomics/HiC FastQs from 20210401.
### Using to identify sequencing contaminants in Panopea-generosa-v1.0 genome assembly
### Generates a Snakemake config file
### Outputs Blobtools2 JSON files for use in the Blobtools2 viewer

### Utilizes NCBI taxonomy dump and customized UniProt database for DIAMOND BLASTx

### Requires Anaconda to be in system $PATH!

### Follows instructions for release v2.6.0 (https://github.com/blobtoolkit/pipeline/tree/release/v2.6.0)

###################################################################################
# These variables need to be set by user

# Set working directory
wd=$(pwd)

# Set base directory for blobltools structure
base_dir=${wd}/blobtoolkit

# Set number of CPUs to use
threads=40

assembly_name=Panopea_generosa_v1

## New genome name for BTK filename requirements (no periods)
genome_fasta=${wd}/Panopea_generosa_v1.fasta

# Programs
## Blobtools2 directory
blobtools2=/gscratch/srlab/programs/blobtoolkit-v2.6.0/blobtools2

## BTK pipeline directory
btk_pipeline=/gscratch/srlab/programs/blobtoolkit-v2.6.0/pipeline

## Name of conda snakemake environment
snakemake_env_name=snakemake_env

## Conda environment directory
conda_dir=/gscratch/srlab/programs/anaconda3/envs/snakemake_env

###################################################################################

# Run snakemake, btk pipeline
conda activate ${snakemake_env_name}

snakemake -p \
--use-conda \
--conda-prefix ${conda_dir} \
--directory ${base_dir} \
--configfile ${wd}/config.yaml \
--stats ${assembly_name}.blobtoolkit.stats \
-j ${threads} \
--rerun-incomplete \
-s ${btk_pipeline}/blobtoolkit.smk \
--resources btk=1

And, here's what the directory structure looks like after the run fails:

.
├── 20210623_pgen_blobtools_Panopea-generosa-v1.0
├── blobtoolkit
│   └── logs
│       ├── minimap
│       │   └── run_sub_pipeline.log
│       └── windowmasker
│           └── run_sub_pipeline.log
├── config.yaml
├── fastq_checksums.md5
├── genome_fasta.md5
├── minimap
├── Panopea_generosa_v1.fasta
├── reads_1.fastq.gz
├── reads_2.fastq.gz
├── slurm-2028128.out
└── windowmasker

And, finally, here's my config file:

[samwhite@mox1 20210623_pgen_blobtools_Panopea-generosa-v1.0]$ cat config.yaml 
assembly:
  accession: draft
  level: scaffold
  scaffold-count: 18
  span: 942353201
  prefix: Panopea_generosa_v1
busco:
  lineage_dir: /gscratch/srlab/sam/data/databases/BUSCO
  lineages:
    - archaea_odb10
    - arthropoda_odb10
    - bacteria_odb10
    - eukaryota_odb10
    - metazoa_odb10
  basal_lineages:
    - archaea_odb10
    - bacteria_odb10
    - eukaryota_odb10
reads:
  paired:
    -
      - reads
      - ILLUMINA
settings:
  blobtools2_path: /gscratch/srlab/programs/blobtoolkit-v2.6.0/blobtools2
  taxdump: /gscratch/srlab/blastdbs/20210401_ncbi_taxonomy
  tmp: /tmp
  blast_chunk: 100000
  blast_max_chunks: 10
  blast_overlap: 0
  blast_min_length: 1000
similarity:
  defaults:
    evalue: 1.0e-10
    max_target_seqs: 10
    import_evalue: 1.0e-25
    taxrule: buscogenes
  diamond_blastx:
    name: reference_proteomes
    path: /gscratch/srlab/blastdbs/20210401_uniprot_btk
  diamond_blastp:
    name: reference_proteomes
    path: /gscratch/srlab/blastdbs/20210401_uniprot_btk
    import_max_target_seqs: 100000
  blastn:
    name: nt
    path: /gscratch/srlab/blastdbs/20210401_ncbi_nt
taxon:
  name: Panopea generosa
  taxid: '1049056'
keep_intermediates: true

Any idea on what's happening? I'm unable to really glean any info from that error message.

Possibly un/related, I'm also seeing an error from minimap, which I also don't know what to do with:

IndexError in line 19 of /gscratch/srlab/programs/blobtoolkit-v2.6.0/pipeline/scripts/functions.py:
list index out of range
  File "/gscratch/srlab/programs/blobtoolkit-v2.6.0/pipeline/minimap.smk", line 38, in <module>
  File "/gscratch/srlab/programs/blobtoolkit-v2.6.0/pipeline/scripts/functions.py", line 19, in reads_by_prefix

EDITED: Removed statement about hidden snakemake directory in regards to minimap error.

rjchallis commented 3 years ago

Sorry about the bug - there was some invalid Snakemake syntax in the window masker rule that my snakemake has been quietly accepting. I've fixed this in the v2.6.1 release so it should run as expected with the new version.

The minimap error is caused by the pipeline now needing file paths for the read files (and assembly fasta, see the README for more). This can now be specified in the yaml with object keys, or using a list in the order prefix, platform, base_count, filename. for paired reads, the forward and reverse file paths should be separated by a semicolon.

Looking at your config, you no longer need tmp in the settings and blobtools2 should be in your PATH, rather than have the path specified in the yaml. and keep_intermediates no longer does anything. All intermediate files are left in the various sub-pipeline subdirectories for cleanup later, and there is a pipeline script to help with this (https://github.com/blobtoolkit/pipeline/blob/master/scripts/transfer_completed.py - if using this file be sure to specify an output directory inside the input directory or the files to keep will be deleted as part of the cleanup)

kubu4 commented 3 years ago

Thanks! I'll give v2.6.1 a try!