can BUSCO database be used as the criteria for Bin_refinement

nshrinath1994 commented 6 years ago

Hey,

I have been working on metagenomic data that comprises yeast and bacteria in the microbiome . We did a WGS on our samples in Illumina platform. From the sequencing results, I'm able to resolve the bacterial genomes but I have hard time testing or binning the assemblies of the yeast part (eukaryotic part). Whenever I use the entire dataset to do a bin refinement (Bins created from entire data without filtering eukaryote reads), the program just stops. But it works great on Bins created from filtered data(Just the prokaryote assemblies filtered using EukRep pipeline) and gives me 100% complete bins.

This just made me think if you can re-configure the bin_refinement pipeline to use BUSCO databases (which comprises single copy gene dataset for some eukaryotes too) instead of CheckM, would it create or make the pipeline suited for data with eukaryotes assembly binning ?

I'm very new this kind of work and just wanted to know if this makes sense or there could be some flaw in my understanding.

ursky commented 6 years ago

Hey there, thank you for the feedback. Unfortunately, Eukaryotic genome extraction and analysis is not within the scope of this project. There are very few pipelines/softwares for this purpose, and not without reason - it is very difficult and requires an entirely new set of tools. However, metaWRAP should still work on data with Eukaryotic content - my own data has algae in it and I never had issues. Can you provide some diagnostic information so we can look into why the pipeline "stops"?

theo-llewellyn commented 4 years ago

Hi following on from this, I'm trying metaWRAP for binning metagenome contigs of a lichen metagenome dataset im working on, containing fungal, algal, bacterial, archaeal contigs. Though I wont be able to use CheckM to assess quality of fungal and algal contigs (I'll analyse them separately with BUSCO anyway), I really like the idea of being able to produce a consolidated set of bins from CONCOCT, MaxBin2 and MetaWatt2. I've got both the binning and bin_refinement modules to work. Is there a way you would recommend to refine the bins without removing the eukaryote bins? I've tried running with --skip-checkm and setting -c 0 -x 100 but not sure if it would produce meaningful results. Any recommendations would be much appreciated. Thanks for the great tool!

ursky commented 4 years ago

The "combining" is not a trivial process algorithmically and relies heavily on the presence of prokaryotic universal marker genes when performing the consolidation between the different bin sets. For this reason, it is not possible to get Eukariotic MAG consolidation working well without a complete overhaul.

That being said, I personally deal with such mixed samples on a regular basis, so I can give some advice for a workaround. I run the normal metawrap refinement pipeline as I would normally, which typically results in somewhat messed up Euk bins. Then I identify the Euk bins in the metabat2 output (you can run BUSCO on all three to cherry pick manually but in my experience CONCOCT and MaxBin2 dont do so well on Euk bins) and pull those out into a seperate bin set (analyze them with BUSCO and whatnot). Finally, remove those eukaryotic contigs from the metawrap consolidated bin set to avoid redundant contigs (you will need to re-run checkm on the updated set). Now you have two groups of bins from the assembly - the prokaryotic from metawrap and the eukaryotic from metabat2. Like I said before, merging Euk bins from multiple sources is not in the scope of the project, but I hope that helped.

theo-llewellyn commented 4 years ago

Hi, That's really helpful, thanks for the advice. I had wondered whether separating eukaryote and prokaryote bins may help so I'll definitely give it a go! Many Thanks, Theo

bxlab / metaWRAP

can BUSCO database be used as the criteria for Bin_refinement #60