anuradhawick / MetaBCC-LR

Reference-free Binning of Metagenomics Long Reads using Coverage and Composition
https://doi.org/10.1093/bioinformatics/btaa441
MIT License
19 stars 0 forks source link

Fasta support? #1

Closed alxsimon closed 3 years ago

alxsimon commented 3 years ago

Hi, I would like to try your pipeline for the classification of fasta sequences (this is in fact a genome assembly where I want to remove contamination).

Are the reads quality scores used for something in the pipeline?

If not, would it be possible to implement fasta support?

As a first approach I may try to create a dummy fastq (but in the end this would be a waste of time and resources if quality scores are not used). Thanks

anuradhawick commented 3 years ago

Hello, Thanks for reaching out.

FASTA is supported. I will update the README to reflect this. FASTQ restriction no longer applies as we have implemented reads filtering using biopython. I hope this helps.

Let us know how it goes. We are working on an improved version of this tool and your input will be highly appreciated.

anuradhawick commented 3 years ago

Implemented and README was updated.

alxsimon commented 3 years ago

Thank you!

alxsimon commented 3 years ago

I tried it and the tool only found 1 bin while I am sure there are extensive bacterial contamination (~20% of the assembly) in this molluscan genome. What are the parameters that will most influence the binning?

Here is what I tried:

python MetaBCC-LR/MetaBCC-LR --reads-path edu_v5.split.fa --threads 32 --max-memory 20000 --output ./edu_bins --sample-count 100 --sensitivity 10

Additionally, the folder images/ is empty, but no error message was displayed, is this usual?

anuradhawick commented 3 years ago

We have currently developed to support reads binning using PacBio and ONT reads. We have only tested 100,000 to 1,000,000 reads and did not check with smaller datasets. I don't think our approach will perform with assemblies. Usually, the sample count needs to be at least 5000 reads to detect the bins.

We cannot use contigs because we count the k-mers of all reads to estimate the coverage to support binning. I hope this clears the doubt. Could you try using the raw set of reads if they are longer than 1000 bp each?

alxsimon commented 3 years ago

Unfortunately, this is 10X chromium data, so short reads.

Thanks anyway for your answers. Best regards

anuradhawick commented 3 years ago

Please have a look at GraphBin and GraphBin2 from our group. It might be helpful if you are using short-read assemblies.

Best regards Anuradha