gbouras13 / hybracter

Automated long-read first bacterial genome assembly tool implemented in Snakemake using Snaketool.
MIT License
108 stars 8 forks source link

How contamination-sensitive is hybracter? #64

Open stas-malavin opened 8 months ago

stas-malavin commented 8 months ago

Hello George, Thanks for this nice piece of software. I have to assemble the genome of a single symbiotic bacterium from bivalve gills. Symbiont DNA prevails, but there's also some minor share of other bacteria that were on the gills (the symbiont is inside). How would you recommend cleaning the reads (both Nanopore and Illumina)? How sensitive is hybracter to contamination?

What I tried: assembled short reads with spades --meta, binned the contigs, identified the symbiont bin, mapped long reads to the bin, assembled the mapped reads with flye/raven, polished with Medaka. Still some contamination is detected by checkM2, and also some BUSCOs are missing from the assembly (maybe due to the absence of short-read polishing?).

The question is, should I do my best to carefully identify the symbiont reads (annotate genes?.. map to the genomes of other detected organisms?..), or can I just feed to hybracter what I already have?

gbouras13 commented 8 months ago

Hi @stas-malavin ,

This is a really good question, I hadn't intended Hybracter be used like this but it should work!. Hybracter will assemble complete 'chromosome(s)' for anything that is circular and above a certain size. So if you have enough long-reads to assemble a complete symbiont genome with Flye, then those will be recovered.

The contamination will be in the non-circular 'plasmid' contigs of hybracter.

Regardless I'd give it a go and let me know you go. Another good option would be to use Unicycler and see if that recovers any circular contigs.

George

stas-malavin commented 7 months ago

Hi @gbouras13 , Thanks for the answer!

Just as a disclaimer, there are two problems in this dataset, short long reads (mean ~1200 nt) and small coverage (12–15, estimated by flye). So, nothing circular, I managed to get 61 contigs (686K longest, 309K N50) with hybracter and 57 contigs (599K longest, 309K N50) with dragonflye, with all other characteristics quite similar either. (For this, I used the reads mapped to the short-read bin.)

And now for the very nice results I got with hybracter from the whole "metagenome". For this, I took all long reads longer than 1000 and short reads sampled to 0.1 share (there's a lot). No cleaning/filtering. It assembled 138 contigs. Then, I used BUSCO with the Chromatiales dataset (the closest), to filter only those contigs that contained BUSCO genes. These were 15, with the longest 1.08M, N50=403K. The coverage of the longest contig was 30. All the BUSCO and CheckM2 values were superior to what I got before, and same number of tRNAs and rRNAs.

My PI previously ran NanoPhase on the same dataset. Interestingly, he got the same amount of CDS, 4793, comp. to 4642 of hybracter on "metagenome" and 5157 of hybracter on MAG-mapped reads, but 44 contigs, N50 287K, and poorer BUSCO/CheckM2 values.

I wonder how will it work with binning instead of my BUSCO step.