BigDataBiology / SemiBin

SemiBin: metagenomics binning with self-supervised deep learning
https://semibin.rtfd.io/
117 stars 11 forks source link

errors running single_easy_bin #96

Closed Luponsky closed 2 years ago

Luponsky commented 2 years ago

Hello dev team, Thanks for your efforts developing and helping the community with Semibin. I wanted use the single_easy_bin option to bin my co-assembled contigs.

at the beginning I had the same problem of #93 , so I tried to use the option with one single thread (-t 1), but then I had another problem:

`Time for merging to orfs_aa_h: 0h 0m 0s 3ms Time for merging to orfs_aa: 0h 0m 0s 4ms Time for processing: 0h 0m 0s 120ms prefilter /tmp/tmpa1zt1tqy/8805611602350589060/orfs_aa /home/userbio/.cache/SemiBin/mmseqs2-GTDB/GTDB /tmp/tmpa1zt1tqy/8805611602350589060/orfs_pref --sub-mat nucl:nucleotide.out,aa:blosum62.out --seed-sub-mat nucl:nucleotide.out,aa:VTML80.out -s 2 -k 0 --k-score 2147483647 --alph-size nucl:5,aa:21 --max-seq-len 65535 --max-seqs 1 --split 0 --split-mode 2 --split-memory-limit 0 -c 0 --cov-mode 0 --comp-bias-corr 1 --diag-score 0 --exact-kmer-matching 0 --mask 1 --mask-lower-case 0 --min-ungapped-score 3 --add-self-matches 0 --spaced-kmer-mode 1 --db-load-mode 0 --pca 1 --pcb 1.5 --threads 1 --compressed 0 -v 3

Query database size: 22570 type: Aminoacid Target split mode. Searching through 7 splits Estimated memory consumption: 46G Target database size: 106052079 type: Aminoacid Process prefiltering step 1 of 7

Index table k-mer threshold: 163 at k-mer size 7 Index table: counting k-mers [=================================================================] 100.00% 15.16M 24m 54s 176ms
Index table: Masked residues: 38524483 Index table: fill Killed========================================================> ] 94.00% 14.25M eta 2m 52s
Error: orf filter prefilter died Error: Running mmseqs taxonomy fail ` Do you know what it could be related to?

this was the command: SemiBin single_easy_bin -i ATT.fa -b ATT*.bam -o ATT_coassembly -t 1 SemiBin version: 1.0.1 installed with conda

All the best, L

luispedro commented 2 years ago

The issue in #93 should be fixed in SemiBin v1.0.1

The message I see is Killed, so my guess is that the process exceeded the memory limits.

How are you running this? Normally mmseqs can run even without too much memory (although it takes much longer), so maybe it is making a mistake when it attempts to determine how much memory it can use

Luponsky commented 2 years ago

Hello, I was running it likem above mentioned: SemiBin single_easy_bin -i ATT.fa -b ATT*.bam -o ATT_easy_bin -t 1

I have just tried to run it without the -t 1, thus: SemiBin single_easy_bin -i ATT.fa -b ATT*.bam -o ATT_easy_bin and this is the new error: [...] 2022-05-13 12:43:17,091 - Generate training data of 0: 2022-05-13 12:43:17,165 - Number of must link pair:119 2022-05-13 12:43:17,165 - Number of can not link pair:471 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 20/20 [00:06<00:00, 3.28it/s] 2022-05-13 12:43:23,097 - Training finished. 2022-05-13 12:43:23,164 - Start binning. 2022-05-13 12:43:30,193 - Edges:28693 2022-05-13 12:43:31,302 - Reclustering. Traceback (most recent call last): File "/home/userbio/miniconda3/envs/SemiBin/bin/SemiBin", line 10, in sys.exit(main()) File "/home/userbio/miniconda3/envs/SemiBin/lib/python3.9/site-packages/SemiBin/main.py", line 1040, in main single_easy_binning( File "/home/userbio/miniconda3/envs/SemiBin/lib/python3.9/site-packages/SemiBin/main.py", line 865, in single_easy_binning binning(logger, args.num_process, data_path, File "/home/userbio/miniconda3/envs/SemiBin/lib/python3.9/site-packages/SemiBin/main.py", line 810, in binning cluster( File "/home/userbio/miniconda3/envs/SemiBin/lib/python3.9/site-packages/SemiBin/cluster.py", line 218, in cluster seeds = cal_num_bins( File "/home/userbio/miniconda3/envs/SemiBin/lib/python3.9/site-packages/SemiBin/utils.py", line 382, in cal_num_bins contig_output = run_orffinder(fasta_path, num_process, tdir) File "/home/userbio/miniconda3/envs/SemiBin/lib/python3.9/site-packages/SemiBin/utils.py", line 335, in runprodigal f.write(open(os.path.join(output, 'contig{}.faa'.format(index)), 'r').read()) FileNotFoundError: [Errno 2] No such file or directory: '/tmp/tmpp209kxq9/contig_31.faa'

Thanks for the support, L

psj1997 commented 2 years ago

There is another bug in this situation. You can try our latest version in the github and install it from source.

Or maybe try something like -t 6.

Sincerely Shaojun

Luponsky commented 2 years ago

Thanks, not using maximum nor minimum threads, binning worked fine, just last question :) Semibin uses gtdb as a reference for binning, does it keep track of which genomes it uses as references? L

psj1997 commented 2 years ago

Thanks!

No, SemiBin just uses the annotation results to generate the cannot-link constraint between contigs (contigs belong to different bins) for training the model.

Sincerely Shaojun