anuradhawick / LRBinner

LRBinner is a long-read binning tool published in WABI 2021 proceedings and AMB.
https://doi.org/10.4230/LIPIcs.WABI.2021.11
GNU General Public License v2.0
28 stars 4 forks source link

Could we use LRBinner after scaffolding in WGA? #23

Open franztastic opened 3 weeks ago

franztastic commented 3 weeks ago

Hello everyone,

I've used your tool in different metagenomics project and now, as suggested by my supervisor in a WGA, I've used as a decontamination tool after scaffolding.

Following, I've tried doing manual curation with Pretext and it seems that I have no contacts between my scaffolds and 33 different chromosomes.

However, I've tested running LRBinner and, later, YAHS and my results are completely different, having now 17 chromosomes with a lot of contact between my scaffolds but with a super-low coverage.

I see that the tool is not made to be used for decontamination after scaffolding but I'm wondering why results are that different.

Thank you very much for your answer!

anuradhawick commented 3 weeks ago

Hi,

Were you planning to use on assembled scaffolds?

Are you able to share a bit more information about data? Are they long read metagenomics?

franztastic commented 3 weeks ago

Hi, Yes, on assembled scaffolds but in a different aim, whole genome assembly not metagenomics. They are long HIFI reads from 20 individuals of a small species of arthropoda and I'm trying to have the WGA with HIC reads as well.

Thanks!

anuradhawick commented 2 weeks ago

HIC reads may not be usable. But there is a chance that binning can be used to separate these species.

Firstly, you need to find some evidence to support the statement; "similar to metagenomics, intra-species oligo nucleotide frequencies are similar while inter-species frequencies are different".

You might like this tool to do that screening first to confirm the hypothesis.

https://github.com/anuradhawick/kmertools

You could see an example in its wiki - https://github.com/anuradhawick/kmertools/wiki/Oligo-nucleotide-computations#example-application the diagram that shows the difference.

Few remarks, LRBinner right away may not be applicable due to the assumption of having approximately a million long reads to work best with coverages from 10X to 100X. But if you have these estimates I am very happy to help.

Let me know

franztastic commented 2 weeks ago

Oh god, my previous messages were not clear at all.. Our data comes from 20 different individuals of the same species and we want to have a whole genome assembly. As the individuals are really tiny we are sure that there are a lot of contaminants there so we thought of using LRBinner to remove that contaminants and work only with our species bin. Our long reads are PacBio hifi reads, about 4million reads and we assume a coverage of about 50x. We've tried with two different approaches.

  1. Using LRBinner prior to scaffolding (we have HiC reads for scaffolding). When it's been time to check our assembly using Pretext we've found out that we've lost almost all contacts between chromosomes, so my contact-map would show me I have 34 chromosomes.
  2. Using LRBinner after scaffolding. When using Pretext, we can see that the coverage is really low (between 1 and 5), however I would see 17 chromosomes.

We assume that LRBinner is not made for decontamination however I'm not sure I understand why I have such these differences.

Sorry for the inconsistency of my prior messages... And thank you very much for providing these tools, I'll check too this other one you mention and study a bit further my species.

anuradhawick commented 1 week ago

Ok I got it now. Sorry.

It's hard to give a straight answer. Because in binning we expect good clusters. In contamination the contamination may or may not be a distinct cluster.

But I guess there might be some luck because contamination has the natural tendency of having very low coverage.