Closed kubu4 closed 3 years ago
Thanks for sharing your config file. I think this error is happening because of a change I made to the way the tax rules are handled. If you change taxrule to eachdistorder
, the run_diamond_blastx step should be able to work out the filenames that it needs. This tax rule uses the distribution of hits along long scaffolds and the files are parsed twice, once with the uniprot results having priority and once with the nt results having priority.
Looking at the rest of the config file I note that:
scaffold-count:
set to a file path, rather than the number of scaffolds.Hope this is enough to get it running. I've been working on making the pipeline more modular, and significantly reducing the time taken by the analysis steps (not quite finished but the new code is in the v2 directory and is mostly working). If this version continues to prove difficult to run I'll put together some notes on how to get the v2 pipeline running.
Thanks again for helping. It is very much appreciated!
you haven't listed any read files so
The FastQ files are reads_1.fastq.gz
and reads_2.fastq.gz
. Sorry for the confusion there.
you have scaffold-count: set to a file path, rather than the number of scaffolds.
Gah! Sorry and thanks for noticing that! Parsing error when generating the YAML file...
Will report back after changing taxrule
.
Alrighty, I've made the changes and have re-run. I still get the same error, but that error is now preceded by a different error:
INFO: using default value for 'assembly.alias'
Panopea_generosa_v1
INFO: using default value for 'reads.single'
[]
Building DAG of jobs...
WorkflowError:
MissingInputException: Missing input files for rule run_diamond_blastx_chunks:
reference_proteomes.root.1.minus..dmnd
MissingInputException: Missing input files for rule run_diamond_blastx:
nullfile
nullfile2
Here's the config YAML:
assembly:
accession: draft
level: scaffold
scaffold-count: 18
span: 942353201
prefix: Panopea_generosa_v1
busco:
lineages:
- eukaryota_odb9
- metazoa_odb9
lineage_dir: /gscratch/srlab/sam/data/databases/BUSCO
reads:
paired:
-
- reads
- ILLUMINA
settings:
blobtools2_path: /gscratch/srlab/programs/blobtoolkit/blobtools2
taxonomy: /gscratch/srlab/blastdbs/20210401_ncbi_taxonomy
tmp: /tmp
blast_chunk: 100000
blast_max_chunks: 10
blast_overlap: 500
chunk: 1000000
similarity:
defaults:
evalue: 1e-25
max_target_seqs: 10
root: 1
mask_ids: []
databases:
-
local: /gscratch/srlab/blastdbs/20210401_ncbi_nt
name: nt
source: ncbi
tools: blast
type: nucl
-
local: /gscratch/srlab/blastdbs/20210401_uniprot_btk
max_target_seqs: 1
name: reference_proteomes
source: uniprot
tools: diamond
type: prot
taxrule: eachdistorder
taxon:
taxid: 1049056
name: Panopea generosa
keep_intermediates: true
Since the new error is referencing the customized DIAMOND BLAST database, here're the contents of that directory (which is specified in the YAML above):
$ ls -ltrh /gscratch/srlab/blastdbs/20210401_uniprot_btk
total 154G
-rw-r--r-- 1 samwhite hyak-srlab 1.5M Feb 8 07:52 README
drwxr-sr-x 327 samwhite hyak-srlab 32K Feb 8 09:10 Archaea
drwxr-sr-x 7945 samwhite hyak-srlab 256K Feb 8 09:56 Bacteria
drwxr-sr-x 1556 samwhite hyak-srlab 64K Feb 8 10:01 Eukaryota
drwxr-sr-x 9865 samwhite hyak-srlab 512K Feb 8 11:29 Viruses
-rw-r--r-- 1 samwhite hyak-srlab 113G Feb 10 14:31 reference_proteomes.tar.gz
-rw-r--r-- 1 samwhite hyak-srlab 13G Apr 5 10:32 reference_proteomes.fasta.gz
-rw-r--r-- 1 samwhite hyak-srlab 1.6G Apr 5 10:42 reference_proteomes.taxid_map
-rw-r--r-- 1 samwhite hyak-srlab 27G Apr 5 11:42 reference_proteomes.dmnd
The database was created manually according the blobtools installation directions.
Thank you again for your time (and patience!) helping with this.
Hi
Sorry for the delay getting back to you. I've been implementing the v2 pipeline and now have a working version in the release/v2.5.0 branch. The new version should be much faster and hopefully will be easier to configure so rather than trying to debug the older pipeline it may be best if you could try this version. The config file looks slightly different, but should mostly carry across from the version you have.
This is very new so there may be some rough edges but @sujaikumar (also working on the project) is starting to test and document the new pipeline so hopefully we can help you get this running soon.
Interesting! Thanks for the heads up. I'll give it a shot and see how it goes.
Should I just close this issue, as v1 of the pipeline will now be "deprecated"?
Thanks for trying it out. Yes, I'll close this one. Will be interesting to know how far you can get with v2 - feel free to create a new issue once you reach the limits of the docs in the README.
I'm getting the following error when running the pipeline:
Clearly, it is looking for files and not finding them. However, it's not clear which files it's looking for, nor which location it's looking in. I've specified the local databases in the config YAML, so not sure what's happening here. Any suggestions? Thanks!
Here's my config YAML: