blobtoolkit / pipeline

[Archived] SnakeMake pipeline to run BlobTools on public assemblies
https://blobtoolkit.genomehubs.org
MIT License
10 stars 4 forks source link

Error: Missing input files for rule run_diamond_blastx_chunks #9

Closed kubu4 closed 3 years ago

kubu4 commented 3 years ago

I'm getting the following error when running the pipeline:

INFO: using default value for 'assembly.alias'
Panopea_generosa_v1
INFO: using default value for 'reads.single'
[]
Building DAG of jobs...
WorkflowError:
MissingInputException: Missing input files for rule run_diamond_blastx_chunks:
nullfile1
nullfile2

Clearly, it is looking for files and not finding them. However, it's not clear which files it's looking for, nor which location it's looking in. I've specified the local databases in the config YAML, so not sure what's happening here. Any suggestions? Thanks!

Here's my config YAML:

$ cat Panopea_generosa_v1.fasta_btk.yaml 
assembly:
  accession: draft
  level: scaffold
  scaffold-count: /gscratch/scrubbed/samwhite/outputs/20210406_pgen_blobtools_Panopea-generosa-v1.0/Panopea_generosa_v1.fasta
  span: 942353201
  prefix: Panopea_generosa_v1
busco:
  lineages:
    - eukaryota_odb9
    - metazoa_odb9
  lineage_dir: /gscratch/srlab/sam/data/databases/BUSCO
reads:
  paired:
    -
      - reads
      - ILLUMINA
settings:
  blobtools2_path: /gscratch/srlab/programs/blobtoolkit/blobtools2
  taxonomy: /gscratch/srlab/blastdbs/20210401_ncbi_taxonomy
  tmp: /tmp
  blast_chunk: 100000
  blast_max_chunks: 10
  blast_overlap: 500
  chunk: 1000000
similarity:
  defaults:
    evalue: 1e-25
    max_target_seqs: 10
    root: 1
    mask_ids: []
  databases:
    -
      local: /gscratch/srlab/blastdbs/20210401_ncbi_nt
      name: nt
      source: ncbi
      tools: blast
      type: nucl
    -
      local: /gscratch/srlab/blastdbs/20210401_uniprot_btk
      max_target_seqs: 1
      name: reference_proteomes
      source: uniprot
      tools: diamond
      type: prot
  taxrule: bestsumorder
taxon:
  taxid: 1049056
  name: Panopea generosa
keep_intermediates: true
rjchallis commented 3 years ago

Thanks for sharing your config file. I think this error is happening because of a change I made to the way the tax rules are handled. If you change taxrule to eachdistorder, the run_diamond_blastx step should be able to work out the filenames that it needs. This tax rule uses the distribution of hits along long scaffolds and the files are parsed twice, once with the uniprot results having priority and once with the nt results having priority.

Looking at the rest of the config file I note that:

Hope this is enough to get it running. I've been working on making the pipeline more modular, and significantly reducing the time taken by the analysis steps (not quite finished but the new code is in the v2 directory and is mostly working). If this version continues to prove difficult to run I'll put together some notes on how to get the v2 pipeline running.

kubu4 commented 3 years ago

Thanks again for helping. It is very much appreciated!

you haven't listed any read files so

The FastQ files are reads_1.fastq.gz and reads_2.fastq.gz. Sorry for the confusion there.

you have scaffold-count: set to a file path, rather than the number of scaffolds.

Gah! Sorry and thanks for noticing that! Parsing error when generating the YAML file...

Will report back after changing taxrule.

kubu4 commented 3 years ago

Alrighty, I've made the changes and have re-run. I still get the same error, but that error is now preceded by a different error:

INFO: using default value for 'assembly.alias'
Panopea_generosa_v1
INFO: using default value for 'reads.single'
[]
Building DAG of jobs...
WorkflowError:
MissingInputException: Missing input files for rule run_diamond_blastx_chunks:
reference_proteomes.root.1.minus..dmnd
MissingInputException: Missing input files for rule run_diamond_blastx:
nullfile
nullfile2

Here's the config YAML:

assembly:
  accession: draft
  level: scaffold
  scaffold-count: 18
  span: 942353201
  prefix: Panopea_generosa_v1
busco:
  lineages:
    - eukaryota_odb9
    - metazoa_odb9
  lineage_dir: /gscratch/srlab/sam/data/databases/BUSCO
reads:
  paired:
    -
      - reads
      - ILLUMINA
settings:
  blobtools2_path: /gscratch/srlab/programs/blobtoolkit/blobtools2
  taxonomy: /gscratch/srlab/blastdbs/20210401_ncbi_taxonomy
  tmp: /tmp
  blast_chunk: 100000
  blast_max_chunks: 10
  blast_overlap: 500
  chunk: 1000000
similarity:
  defaults:
    evalue: 1e-25
    max_target_seqs: 10
    root: 1
    mask_ids: []
  databases:
    -
      local: /gscratch/srlab/blastdbs/20210401_ncbi_nt
      name: nt
      source: ncbi
      tools: blast
      type: nucl
    -
      local: /gscratch/srlab/blastdbs/20210401_uniprot_btk
      max_target_seqs: 1
      name: reference_proteomes
      source: uniprot
      tools: diamond
      type: prot
  taxrule: eachdistorder
taxon:
  taxid: 1049056
  name: Panopea generosa
keep_intermediates: true

Since the new error is referencing the customized DIAMOND BLAST database, here're the contents of that directory (which is specified in the YAML above):

$ ls -ltrh /gscratch/srlab/blastdbs/20210401_uniprot_btk
total 154G
-rw-r--r--    1 samwhite hyak-srlab 1.5M Feb  8 07:52 README
drwxr-sr-x  327 samwhite hyak-srlab  32K Feb  8 09:10 Archaea
drwxr-sr-x 7945 samwhite hyak-srlab 256K Feb  8 09:56 Bacteria
drwxr-sr-x 1556 samwhite hyak-srlab  64K Feb  8 10:01 Eukaryota
drwxr-sr-x 9865 samwhite hyak-srlab 512K Feb  8 11:29 Viruses
-rw-r--r--    1 samwhite hyak-srlab 113G Feb 10 14:31 reference_proteomes.tar.gz
-rw-r--r--    1 samwhite hyak-srlab  13G Apr  5 10:32 reference_proteomes.fasta.gz
-rw-r--r--    1 samwhite hyak-srlab 1.6G Apr  5 10:42 reference_proteomes.taxid_map
-rw-r--r--    1 samwhite hyak-srlab  27G Apr  5 11:42 reference_proteomes.dmnd

The database was created manually according the blobtools installation directions.

Thank you again for your time (and patience!) helping with this.

rjchallis commented 3 years ago

Hi

Sorry for the delay getting back to you. I've been implementing the v2 pipeline and now have a working version in the release/v2.5.0 branch. The new version should be much faster and hopefully will be easier to configure so rather than trying to debug the older pipeline it may be best if you could try this version. The config file looks slightly different, but should mostly carry across from the version you have.

This is very new so there may be some rough edges but @sujaikumar (also working on the project) is starting to test and document the new pipeline so hopefully we can help you get this running soon.

kubu4 commented 3 years ago

Interesting! Thanks for the heads up. I'll give it a shot and see how it goes.

Should I just close this issue, as v1 of the pipeline will now be "deprecated"?

rjchallis commented 3 years ago

Thanks for trying it out. Yes, I'll close this one. Will be interesting to know how far you can get with v2 - feel free to create a new issue once you reach the limits of the docs in the README.