mmseqs2 prefilter requires too much memory

Bgi-zsy commented 1 year ago

Dear Developer,
I am using genomad to annotate virus from metagenomic sequencing. I met a problem with mmseqs2.py prefilter. I have read the FAQ and used --splits 8, but it still showed memory was not enough.

Environment:

Linux x86_64
1000G memory
8 threads
genomad: 1.5.1
mmseq2 version: 14.7e284.

Input file: FASTA.fa (5G size)

My annotate code is:

$MY_PATH/genomad annotate --splits 8 --threads 8 --cleanup $MY_PATH/FASTA.fa $MY_PATH/demo $MY_PATH/genomad_db_v1.1

$MY_PATH means real work dir pathway.

Error shows as follow:

prefilter $MY_PATH/FASTA_annotate/FASTA_mmseqs2/query_db/query_db $MY_PATH/genomad_db_v1.1/genomad_db $MY_PATH/FASTA_annotate/FASTA_mmseqs2/tmp/11571856592932011841/pref --sub-mat 'aa:blosum62.out,nucl:nucleotide.out' --seed-sub-mat 'aa:VTML80.out,nucl:nucleotide.out' -s 4.2 -k 5 --k-score seq:2147483647,prof:2147483647 --alph-size aa:21,nucl:5 --max-seq-len 65535 --max-seqs 300 --split 8 --split-mode 0 --split-memory-limit 0 -c 0.2 --cov-mode 1 --comp-bias-corr 1 --comp-bias-corr-scale 1 --diag-score 1 --exact-kmer-matching 0 --mask 1 --mask-prob 0.9 --mask-lower-case 0 --min-ungapped-score 20 --add-self-matches 0 --spaced-kmer-mode 1 --db-load-mode 0 --pca substitution:1.100,context:1.400 --pcb substitution:4.100,context:5.800 --threads 8 --compressed 0 -v 3  
Query database size: 33938637 type: Aminoacid
Target split mode. Searching through 8 splits
Estimated memory consumption: 577M
Target database size: 227897 type: Profile
Process prefiltering step 1 of 8
Index table k-mer threshold: 89 at k-mer size 5
Index table: counting k-mers
[=================================================================] 28.46K 85h 53m 21s 171ms
Index table: Masked residues: 0
Can not allocate entries memory in IndexTable::initMemory
Error: Prefilter died

Looking forward to your reply and you can and you can contact me with e-mail zengshengyin@genomics.cn
Thank you!

apcamargo commented 1 year ago

Hi @Bgi-zsy

8 splits is not enough for just 1GB of memory. Did you try to use more splits? You can try something really high, like --splits 40 (or more).

Alternatively, you can combine using --splits with reducing the search sensitivity (e.g. -s 3.0, down from the default value of 4.2), which reduces memory usage. I don't recommend it though, as it would cause a decrease in the rate of gene annotation and in classification accuracy.

EDIT: I just notice you have 1000G available, not 1G. I've never seen this problem before. Can you try to download the database again? Which version are you using? (you can find that in the genomad_db/version.txt file)

Also, try to run without --splits as you don't need to split the database with this much memory available.

Bgi-zsy commented 1 year ago

Dear Developer, Thank for your help. After receiving your answer, I update my genomad-database from 1.1 to 1.3 and my code run successfully. My code is:

genomad end-to-end --cleanup -t 4 --splits 8  $MY_PATH/FASTA.fasta.gz $MY_PATH/Output $MY_PATH/genomad_db1.3/genomad_db

apcamargo commented 1 year ago

That's great to hear! Let me know if you face this issue again.

apcamargo / genomad

mmseqs2 prefilter requires too much memory #21