mmseqs2 prefilter failed: No k-mer could be extracted for the database genomad_db/genomad_db

haleyhallowell commented 1 year ago

Hello! I am currently trying to utilize the genomad annotate module to annotate a .fna file of Megahit assembled contigs. After downloading the database and attaching a unique identifier to each .fna headline (because they all started with k127), i ran the following command:

genomad annotate final_vOTUs_numbered.fna ./genomad_output ./genomad_db

I get this error directly from genomad:

Traceback (most recent call last): File "/home/hhallow1/.conda/envs/genomad/lib/python3.10/site-packages/genomad/mmseqs2.py", line 137, in run_mmseqs2 subprocess.run(command, stdout=fout, stderr=fout, check=True) File "/home/hhallow1/.conda/envs/genomad/lib/python3.10/subprocess.py", line 526, in run raise CalledProcessError(retcode, process.args, subprocess.CalledProcessError: Command '['mmseqs', 'search', PosixPath('genomad_output/final_vOTUs_numbered_annotate/final_vOTUs_numbered_mmseqs2/query_db/query_db'), PosixPath('genomad_db/genomad_db'), PosixPath('genomad_output/final_vOTUs_numbered_annotate/final_vOTUs_numbered_mmseqs2/search_db/search_db'), PosixPath('genomad_output/final_vOTUs_numbered_annotate/final_vOTUs_numbered_mmseqs2/tmp'), '--threads', '48', '-s', '4.2', '--cov-mode', '1', '-c', '0.2', '-e', '0.001', '--split', '0', '--split-mode', '0', '--max-seqs', '1000000', '--min-ungapped-score', '20', '--max-rejected', '225']' returned non-zero exit status 1.

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "/home/hhallow1/.conda/envs/genomad/bin/genomad", line 8, in sys.exit(cli()) File "/home/hhallow1/.conda/envs/genomad/lib/python3.10/site-packages/click/core.py", line 1130, in call return self.main(*args, kwargs) File "/home/hhallow1/.conda/envs/genomad/lib/python3.10/site-packages/rich_click/rich_group.py", line 21, in main rv = super().main(args, standalone_mode=False, kwargs) File "/home/hhallow1/.conda/envs/genomad/lib/python3.10/site-packages/click/core.py", line 1055, in main rv = self.invoke(ctx) File "/home/hhallow1/.conda/envs/genomad/lib/python3.10/site-packages/click/core.py", line 1657, in invoke return _process_result(sub_ctx.command.invoke(sub_ctx)) File "/home/hhallow1/.conda/envs/genomad/lib/python3.10/site-packages/click/core.py", line 1404, in invoke return ctx.invoke(self.callback, ctx.params) File "/home/hhallow1/.conda/envs/genomad/lib/python3.10/site-packages/click/core.py", line 760, in invoke return __callback(args, kwargs) File "/home/hhallow1/.conda/envs/genomad/lib/python3.10/site-packages/genomad/cli.py", line 425, in annotate genomad.annotate.main( File "/home/hhallow1/.conda/envs/genomad/lib/python3.10/site-packages/genomad/modules/annotate.py", line 202, in main mmseqs2_obj.run_mmseqs2(threads, sensitivity, evalue, splits) File "/home/hhallow1/.conda/envs/genomad/lib/python3.10/site-packages/genomad/mmseqs2.py", line 140, in run_mmseqs2 raise Exception(f"'{command_str}' failed.") from e Exception: 'mmseqs search genomad_output/final_vOTUs_numbered_annotate/final_vOTUs_numbered_mmseqs2/query_db/query_db genomad_db/genomad_db genomad_output/final_vOTUs_numbered_annotate/final_vOTUs_numbered_mmseqs2/search_db/search_db genomad_output/final_vOTUs_numbered_annotate/final_vOTUs_numbered_mmseqs2/tmp --threads 48 -s 4.2 --cov-mode 1 -c 0.2 -e 0.001 --split 0 --split-mode 0 --max-seqs 1000000 --min-ungapped-score 20 --max-rejected 225' failed.

Here is the output from mmseqs2:

Converting sequences [===== Time for merging to query_db_h: 0h 0m 0s 79ms Time for merging to query_db: 0h 0m 0s 31ms Database type: Aminoacid Time for processing: 0h 0m 1s 380ms search genomad_output/final_vOTUs_numbered_annotate/final_vOTUs_numbered_mmseqs2/query_db/query_db genomad_db/genomad_db genomad_output/final_vOTUs_numbered_annotate/final_vOTUs_numbered_m mseqs2/search_db/search_db genomad_output/final_vOTUs_numbered_annotate/final_vOTUs_numbered_mmseqs2/tmp --threads 48 -s 4.2 --cov-mode 1 -c 0.2 -e 0.001 --split 0 --split-mode 0 --max-seq s 1000000 --min-ungapped-score 20 --max-rejected 225

MMseqs Version: 13.45111 Substitution matrix nucl:nucleotide.out,aa:blosum62.out Add backtrace false Alignment mode 2 Alignment mode 0 Allow wrapped scoring false E-value threshold 0.001 Seq. id. threshold 0 Min alignment length 0 Seq. id. mode 0 Alternative alignments 0 Coverage threshold 0.2 Coverage mode 1 Max sequence length 65535 Compositional bias 1 Max reject 225 Max accept 2147483647 Include identical seq. id. false Preload mode 0 Pseudo count a 1 Pseudo count b 1.5 Score bias 0 Realign hits false Realign score bias -0.2 Realign max seqs 2147483647 Gap open cost nucl:5,aa:11 Gap extension cost nucl:2,aa:1 Zdrop 40 Threads 48 Compressed 0 Verbosity 3 Seed substitution matrix nucl:nucleotide.out,aa:VTML80.out Sensitivity 4.2 k-mer length 5 k-score 2147483647 Alphabet size nucl:5,aa:21 Max results per query 1000000 Split database 0 Split mode 0 Split memory limit 0 Diagonal scoring true Exact k-mer matching 0 Mask residues 1 Mask lower case residues 0 Minimum diagonal score 20 Spaced k-mers 1 Spaced k-mer pattern
Local temporary path
Rescore mode 0 Remove hits by seq. id. and coverage false Sort results 0 Mask profile 1 Profile E-value threshold 0.1 Global sequence weighting false Allow deletions false Filter MSA 1 Maximum seq. id. threshold 0.9 Minimum seq. id. 0 Minimum score per column -20 Minimum coverage 0 Select N most diverse seqs 1000 Min codons in orf 30 Max codons in length 32734 Max orf gaps 2147483647 Contig start mode 2 Contig end mode 2 Orf start mode 1 Forward frames 1,2,3 Reverse frames 1,2,3 Translation table 1 Translate orf 0 Use all table starts false Offset of numeric ids 0 Create lookup 0 Add orf stop false Overlap between sequences 0 Sequence split mode 1 Header split mode 0 Chain overlapping alignments 0 Merge query 1 Search type 0 Search iterations 1 Start sensitivity 4 Search steps 1 Exhaustive search mode false Filter results during exhaustive search 0 Strand selection 1 LCA search mode false Disk space limit 0 MPI runner
Force restart with latest tmp false Remove temporary files false

prefilter genomad_output/final_vOTUs_numbered_annotate/final_vOTUs_numbered_mmseqs2/query_db/query_db genomad_db/genomad_db genomad_output/final_vOTUs_numbered_annotate/final_vOTUs_numbered_mmseqs2/tmp/4444936417411739143/pref --sub-mat nucl:nucleotide.out,aa:blosum62.out --seed-sub-mat nucl:nucleotide.out,aa:VTML80.out -s 4.2 -k 5 --k-score 2147483647 --alph-size nucl:5,aa:21 --max-seq-len 65535 --max-seqs 1000000 --split 0 --split-mode 0 --split-memory-limit 0 -c 0.2 --cov-mode 1 --comp-bias-corr 1 --diag-score 1 --exact-kmer-matching 0 --mask 1 --mask-lower-case 0 --min-ungapped-score 20 --add-self-matches 0 --spaced-kmer-mode 1 --db-load-mode 0 --pca 1 --pcb 1.5 --threads 48 --compressed 0 -v 3

Query database size: 56046 type: Aminoacid Estimated memory consumption: 1G Target database size: 227897 type: Profile Process prefiltering step 1 of 1

Index table k-mer threshold: 104 at k-mer size 5 Index table: counting k-mers [=================================================================] 227.90K 10s 479ms Index table: Masked residues: 0 No k-mer could be extracted for the database genomad_db/genomad_db. Maybe the sequences length is less than 14 residues. Error: Prefilter died

the .fna file, the genomad_output directory and the genomad_db directory are all in the same directory, and i am running the command from that directory as well. Any ideas how to fix this? Thanks!!

apcamargo commented 1 year ago

Hi @haleyhallowell. I don't think I've seen this error before. Do you still get it if you remove short sequences? Could you share the input with me?

haleyhallowell commented 1 year ago

Hey! Sure, I can share with you; it is attached! I'm currently waiting on another run to go through the HPC queue. I noticed that when i used pip install genomad that i had to manually install prodigal-gv and mmseqs2, so i made a new environment using a conda install. ill update you once that goes through! final_vOTUs_numbered.txt.zip

Not sure on the short sequences, nothing should be that short as i filter out <2000bp

haleyhallowell commented 1 year ago

Hi! Just wanted to let you know i got it working -- seems like it was the weird install with pip i mentioned above that made it throw this error. Conda install the second time got it working. Thanks!

apcamargo commented 1 year ago

Thanks for letting me know! I'll close the issue.

Maybe it was a problem with the MMseqs2 version? The latest versions of geNomad are only compatible with version 14-7e284.

apcamargo / genomad

mmseqs2 prefilter failed: No k-mer could be extracted for the database genomad_db/genomad_db #25