blobtoolkit / blobtoolkit

Interactive quality assessment of genome assemblies
http://blobtoolkit.genomehubs.org
MIT License
84 stars 10 forks source link

Pipeline Fails blastn Error #66

Open biomobot opened 2 years ago

biomobot commented 2 years ago

I'm trying to run the blobtoolkit snakemake pipeline of and it seems like I'm getting an error each time trying to run the blastn step. I have followed the following to download the databases: https://github.com/blobtoolkit/pipeline#databases Thank you for your time and help in advance.

This is the output terminal when running the blobtoolkit tool in snakemake:

snakemake -p -j $THREADS --directory $DATA_DIR/$ACCESSION/$TOOL --configfile $DATA_DIR/$ACCESSION/config.yaml --latency-wait 60 --stats $DATA_DIR/$ACCESSION/$TOOL.stats -s $SNAKE_DIR/$TOOL.smk Building DAG of jobs... The code used to generate one or several output files has changed: To inspect which output files have changes, run 'snakemake --list-code-changes'. To trigger a re-run, use 'snakemake -R $(snakemake --list-code-changes)'. Using shell: /usr/bin/bash Provided cores: 20 Rules claiming more threads will be scaled down. Job stats: job count min threads max threads


all 1 1 1 run_sub_pipeline 4 1 20 total 5 1 20

Select jobs to execute...

[Wed Apr 20 11:50:06 2022] rule run_sub_pipeline: input: /media/mbio/SATA_SSD/AP/Ralstonia_run_1/Ralstonia_run_1_analysis/blob_run1_ral_data_dir/diamond.stats, /media/mbio/SATA_SSD/AP/Ralstonia_run_1/Ralstonia_run_1_analysis/blob_run1_ral_data_dir/windowmasker.stats output: /media/mbio/SATA_SSD/AP/Ralstonia_run_1/Ralstonia_run_1_analysis/blob_run1_ral_data_dir/blastn.stats log: logs/blastn/run_sub_pipeline.log jobid: 1 benchmark: logs/blastn/run_sub_pipeline.benchmark.txt wildcards: tool=blastn threads: 20 resources: tmpdir=/tmp

snakemake -p -j 20 --directory /media/mbio/SATA_SSD/AP/Ralstonia_run_1/Ralstonia_run_1_analysis/blob_run1_ral_data_dir/blastn --configfile /media/mbio/SATA_SSD/AP/Ralstonia_run_1/Ralstonia_run_1_analysis/blob_run1_ral_data_dir/config.yaml --latency-wait 60 --stats /media/mbio/SATA_SSD/AP/Ralstonia_run_1/Ralstonia_run_1_analysis/blob_run1_ral_data_dir/blastn.stats --restart-times 0 -s /home/mbio/blobtoolkit/insdc-pipeline/blastn.smk 2> logs/blastn/run_sub_pipeline.log [Wed Apr 20 11:56:43 2022] Error in rule run_sub_pipeline: jobid: 1 output: /media/mbio/SATA_SSD/AP/Ralstonia_run_1/Ralstonia_run_1_analysis/blob_run1_ral_data_dir/blastn.stats log: logs/blastn/run_sub_pipeline.log (check log file(s) for error message) shell: snakemake -p -j 20 --directory /media/mbio/SATA_SSD/AP/Ralstonia_run_1/Ralstonia_run_1_analysis/blob_run1_ral_data_dir/blastn --configfile /media/mbio/SATA_SSD/AP/Ralstonia_run_1/Ralstonia_run_1_analysis/blob_run1_ral_data_dir/config.yaml --latency-wait 60 --stats /media/mbio/SATA_SSD/AP/Ralstonia_run_1/Ralstonia_run_1_analysis/blob_run1_ral_data_dir/blastn.stats --restart-times 0 -s /home/mbio/blobtoolkit/insdc-pipeline/blastn.smk 2> logs/blastn/run_sub_pipeline.log (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)

Shutting down, this might take some time. Exiting because a job execution failed. Look above for error message The code used to generate one or several output files has changed: To inspect which output files have changes, run 'snakemake --list-code-changes'. To trigger a re-run, use 'snakemake -R $(snakemake --list-code-changes)'.

This is the snakemake.log file Building DAG of jobs... Using shell: /usr/bin/bash Provided cores: 20 Rules claiming more threads will be scaled down. Job stats: job count min threads max threads


all 1 1 1 chunk_nohit_fasta 1 1 1 extract_nohit_fasta 1 4 4 run_blastn 1 20 20 unchunk_blastn 1 1 1 total 5 1 20

Select jobs to execute...

[Wed Apr 20 11:50:07 2022] rule extract_nohit_fasta: input: ../diamond/run1_ral.diamond.reference_proteomes.out, ../windowmasker/run1_ral.windowmasker.fasta output: run1_ral.nohit.fasta log: logs/run1_ral/extract_nohit_fasta.log jobid: 4 benchmark: logs/run1_ral/extract_nohit_fasta.benchmark.txt wildcards: assembly=run1_ral threads: 4 resources: tmpdir=/tmp

seqtk subseq ../windowmasker/run1_ral.windowmasker.fasta <(grep '>' ../windowmasker/run1_ral.windowmasker.fasta | grep -v -w -f <(awk '{if($14<1e-25){print $1}}' ../diamond/run1_ral.diamond.reference_proteomes.out | sort | uniq) | cut -f1 | sed 's/>//') > run1_ral.nohit.fasta [Wed Apr 20 11:50:07 2022] Finished job 4. 1 of 5 steps (20%) done Select jobs to execute...

[Wed Apr 20 11:50:07 2022] rule chunk_nohit_fasta: input: run1_ral.nohit.fasta output: run1_ral.nohit.fasta.chunks log: logs/run1_ral/chunk_fasta.log jobid: 3 benchmark: logs/run1_ral/chunk_fasta.benchmark.txt wildcards: assembly=run1_ral resources: tmpdir=/tmp

/home/mbio/miniconda3/envs/btk_env/bin/python3.8 /media/mbio/SATA_SSD/AP/Ralstonia_run_1/Ralstonia_run_1_analysis/blob_run1_ral_data_dir/blastn/.snakemake/scripts/tmplqv_k0q2.chunk_fasta.py [Wed Apr 20 11:50:07 2022] Finished job 3. 2 of 5 steps (40%) done Select jobs to execute...

[Wed Apr 20 11:50:07 2022] rule run_blastn: input: run1_ral.nohit.fasta.chunks, /home/mbio/databases/nt_2021_06/nt.nal output: run1_ral.blastn.nt.out.raw log: logs/run1_ral/run_blastn.log jobid: 2 benchmark: logs/run1_ral/run_blastn.benchmark.txt wildcards: assembly=run1_ral threads: 20 resources: tmpdir=/tmp

if [ -s run1_ral.nohit.fasta.chunks ]; then blastn -task megablast -query run1_ral.nohit.fasta.chunks -db /home/mbio/databases/nt_2021_06/nt -outfmt "6 qseqid staxids bitscore std" -max_target_seqs 10 -max_hsps 1 -evalue 1e-10 -num_threads 20 -negative_taxids 190721 -lcase_masking -dust "20 64 1" > run1_ral.blastn.nt.out.raw 2> logs/run1_ral/run_blastn.log || sleep 30; if [ -s logs/run1_ral/run_blastn.log ]; then echo "Restarting blastn without taxid filter" >> logs/run1_ral/run_blastn.log; > run1_ral.blastn.nt.out.raw; blastn -task megablast -query run1_ral.nohit.fasta.chunks -db /home/mbio/databases/nt_2021_06/nt -outfmt "6 qseqid staxids bitscore std" -max_target_seqs 10 -max_hsps 1 -evalue 1e-10 -num_threads 20 -lcase_masking -dust "20 64 1" > run1_ral.blastn.nt.out.raw 2>> logs/run1_ral/run_blastn.log; fi else > run1_ral.blastn.nt.out.raw; fi [Wed Apr 20 11:56:42 2022] Error in rule run_blastn: jobid: 2 output: run1_ral.blastn.nt.out.raw log: logs/run1_ral/run_blastn.log (check log file(s) for error message) shell: if [ -s run1_ral.nohit.fasta.chunks ]; then blastn -task megablast -query run1_ral.nohit.fasta.chunks -db /home/mbio/databases/nt_2021_06/nt -outfmt "6 qseqid staxids bitscore std" -max_target_seqs 10 -max_hsps 1 -evalue 1e-10 -num_threads 20 -negative_taxids 190721 -lcase_masking -dust "20 64 1" > run1_ral.blastn.nt.out.raw 2> logs/run1_ral/run_blastn.log || sleep 30; if [ -s logs/run1_ral/run_blastn.log ]; then echo "Restarting blastn without taxid filter" >> logs/run1_ral/run_blastn.log; > run1_ral.blastn.nt.out.raw; blastn -task megablast -query run1_ral.nohit.fasta.chunks -db /home/mbio/databases/nt_2021_06/nt -outfmt "6 qseqid staxids bitscore std" -max_target_seqs 10 -max_hsps 1 -evalue 1e-10 -num_threads 20 -lcase_masking -dust "20 64 1" > run1_ral.blastn.nt.out.raw 2>> logs/run1_ral/run_blastn.log; fi else > run1_ral.blastn.nt.out.raw; fi (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)

Removing output files of failed job run_blastn since they might be corrupted: run1_ral.blastn.nt.out.raw Shutting down, this might take some time. Exiting because a job execution failed. Look above for error message Complete log: .snakemake/log/2022-04-20T115007.034832.snakemake.log

This is the run.blastn.log Error: NCBI C++ Exception: T0 "/opt/conda/conda-bld/blast_1615544030046/work/blast/c++/src/serial/objistrasnb.cpp", line 499: Error: (CSerialException::eOverflow) byte 66: overflow error ( at [].[].gi) T0 "/opt/conda/conda-bld/blast_1615544030046/work/blast/c++/src/serial/member.cpp", line 768: Error: (CSerialException::eOverflow) ncbi::CMemberInfoFunctions::ReadWithSetFlagMember() - error while reading seqid ( at Blast-def-line-set.[].[].seqid.[].[].gi)

Restarting blastn without taxid filter Error: NCBI C++ Exception: T0 "/opt/conda/conda-bld/blast_1615544030046/work/blast/c++/src/serial/objistrasnb.cpp", line 499: Error: (CSerialException::eOverflow) byte 88: overflow error ( at [].[].gi) T0 "/opt/conda/conda-bld/blast_1615544030046/work/blast/c++/src/serial/member.cpp", line 768: Error: (CSerialException::eOverflow) ncbi::CMemberInfoFunctions::ReadWithSetFlagMember() - error while reading seqid ( at Blast-def-line-set.[].[].seqid.[].[].gi)

rjchallis commented 2 years ago

From the run.blastn.log, this looks like it is a sequence ID parsing error in blastn. I've seen this when sequence IDs had pipes but didn't use the standard NCBI conventions. Do you have pipes (or any other special characters) in your sequence headers?