cerebis / bin3C

Extract metagenome-assembled genomes (MAGs) from metagenomic data using Hi-C.
GNU Affero General Public License v3.0
23 stars 7 forks source link

binning with references splitted into fragments #37

Closed xfengnefx closed 2 years ago

xfengnefx commented 3 years ago

Hi,

The Typical Workflow of README.md suggested that reference (contigs) may be splitted before mapping hiC reads. I assume later on during mkmap, the fragmented reference will be used instead of the original one, and the output bins (bin3C_clust/fasta/*.fna) contain contig fragments. Please correct me if I've misunderstood anything.

What would be recommended post processing for these bins? For a regular workflow, each (raw) contig would appear in one and only one bin. So I tried to assign the (raw) contigs to the bin which has most of its fragments, which came out not great. I wonder if it's because I'm not doing it correctly, or splitting is just not the way to go for this dataset.

Thanks!

cerebis commented 2 years ago

Hello @xfengnefx.

You'll have to point me to the section of README.md that suggestings fragmenting the assembly contigs. I did explore the effects of splitting contigs in the development process, but I could not measure an advantage to doing so. In addition, you then have the potential problem of reassociating the many pieces.

The standard workflow for bin3C is to take the assembly contigs as they are given. I do not have an ancillary workflow for handling split contigs.

The problem of of soft-clustering in bin3C is an open one (multiple bin assignments per contig). Clustering algorithms I have tested to date, have not performed well enough to supplant infomap. The most likely approach will be a second stage of reconcillation using a measure of significnt association, however my experiments within this development path have not been completed.

xfengnefx commented 2 years ago

Thanks for the reply! Good to know. I think I was referring to the "Optional step. Split references into fragments" part (i.e. to use split_ref.py) to try to bin some rather long contigs from long read assemblies.

cerebis commented 2 years ago

Since you mention it, over the last year I have also been working on metagenomics assemblies based on long-reads. I have been noticing that for some MAGs within some assemblies, the contact matrices seem to have contigs with pockets of strong association followed by large regions with none. These can also appear to interact with other MAGs.

In one instance, I had been pooling multiple long-read assemblies (assembled with Flye) and these suspicious looking interactions ended up being caused by chimeric contigs (between kingdoms no less).

I suspect that a large portion of this is not due to spurious assignment but the nature of long-read (ONT for me) assemblies with Flye. I have tried to counter this by increasing the stringency of an acceptable pair during the mkmap stage of bin3C. This work is found on the py3 development branch. I currently have some outstanding commits to push back to this branch -- but testing is not complete.

xfengnefx commented 2 years ago

Interesting, thanks for the note on the dev branch and binning of ONT assemblies.

I've only tried to bin pacbio hifi-based assemblies. I have the impression that the main pitfall was that because hifi assemblies could separate some closely related strains or genomic regions, both bin3C and other binning tools I tried would tend to generate over-completed bins (marked with high contamination rate) which contain multiple haplotypes.

caused by chimeric contigs

Out of curiosity, is this evaluated by taxonomy based on 16S rRNA / protein coding genes?