anuradhawick / MetaBCC-LR

Reference-free Binning of Metagenomics Long Reads using Coverage and Composition
https://doi.org/10.1093/bioinformatics/btaa441
MIT License
19 stars 0 forks source link

Just can find one Bin in the final.txt #13

Open Bioinformations opened 1 year ago

Bioinformations commented 1 year ago

Hello When i use the this command :"python mbcclr --resume -r nanoflit.fasta -o test_output -e umap -c 613187 -k 5 -t 100" , and the finnal.txt just can find one Bin-1. I am not sure that some parameters are set correctly.

The operation of the software are as follows: 2023-02-22 09:16:59,945 - INFO - Command mbcclr --resume -r PGA-nanoflit.fasta -o test_output2 -e umap -c 613187 -k 5 -t 100 2023-02-22 09:16:59,945 - INFO - Resuming the program from previous checkpoints 2023-02-22 09:16:59,945 - INFO - Counting K-mers INPUT FILE PGA-nanoflit.fasta OUTPUT FILE test_output2/profiles/3mers K_SIZE 5 THREADS 100 Profile Size 512 Total 5-mers 1024 Loaded Reads 6131871 2023-02-22 09:40:24,180 - INFO - Counting K-mers complete 2023-02-22 09:40:24,181 - INFO - Counting 15-mers INPUT FILE PGA-nanoflit.fasta OUTPUT FILE test_output2/profiles/15mers-counts THREADS 100 Loaded Reads 6131871 WRITING TO FILE COMPLETED : Output at - test_output2/profiles/15mers-counts 2023-02-22 09:48:27,783 - INFO - Counting 15-mers complete 2023-02-22 09:48:27,784 - INFO - Generating 15-mer profiles K-Mer file test_output2/profiles/15mers-counts LOADING KMERS TO RAM FINISHED LOADING KMERS TO RAM INPUT FILE PGA-nanoflit.fasta OUTPUT FILE test_output2/profiles/15mers THREADS 100 BIN WIDTH 10 BINS IN HIST 32 Loaded Reads 6131871 COMPLETED : Output at - test_output2/profiles/15mers 2023-02-22 09:54:35,125 - INFO - Generating 15-mer profiles complete 2023-02-22 09:54:35,126 - INFO - Sampling Reads 2023-02-22 10:07:02,935 - INFO - Sampling reads complete 2023-02-22 10:07:02,936 - INFO - Binning sampled reads

anuradhawick commented 1 year ago

Hi,

You need to check different parameters.

Originally I used;

python mbcclr --resume \
        --reads-path nanoflit.fasta \
        --bin-size 32 \
        --bin-count 32 \                     
        --output test_output \
        --embedding  tsne \
        --k-size \
        --threads 8

Note - tsne works okay. UMAP sometimes behave differently. Since this is developed for raw reads I'd start with a smaller k-mer like 3 or 4 to see how it performs.

Change -c to something like 10000, 20000 or 50000.

Let me know if this worked.

Also checkout my other long read binning tools like LRBinner and OBLR

anuradhawick commented 1 year ago

FYI

https://github.com/anuradhawick/LRBinner

https://github.com/anuradhawick/oblr

Bioinformations commented 1 year ago

Thanks for your response! I've tried several combinations of parameters,but just can find less bins ,sometimes 3 bins ,and 2 bins. And i find that every bin which have been separated is too big , almost 5GB. And i tried use the LRBinner , and the parameters as follows : python lrbinner.py reads -r ../out_dir/0.2sample.fasta -bc 10 -bs 32 -o lrb_output --resume -mbs 5000 --ae-dims 4 --ae-epochs 200 -bit 0 -t 100 When finish this program, just can find 3 bins , and the bin is big ,too . may i need to attempt more different parameters about the --bin count and --bin size ?

anuradhawick commented 1 year ago

It seem like a complex dataset

try to increase -bit to 50, also increase -bc and -bs and see.

how many bins do you expect to find? Did you try OBLR?

Bioinformations commented 1 year ago

Yeah , I think it"s really a complex dataset. I want have more than 40 bins from this. Cuz I have tried using the illumina sequencing data for the same sample , its can be bined almost 100 bins . I" m trying to set large parameters , also wiil try OBLR, too.