Enable binning of eukaryotic genomes

jason-c-kwan commented 4 years ago

Since we are not eukaryotic experts perhaps @hyphaltip would have some useful suggestions on how to implement this, but here are my thoughts:

We would need to use a eukaryotic gene finder like Augustus. However, it probably wouldn't be incredibly accurate without RNA data (although I guess it could be an option to include that), and I don't know if anyone has ever tried it on metagenomes.
There is a standard gene set that eukaryotic genomics efforts use to determine how complete their genome is. Sorry I don't have a citation to hand. However, while this would give us estimated completeness, I'm not sure whether it would give us purity because I'm not sure if they are all single copy.
As outlined in the NSF proposal we are planning on checking the taxonomic congruence of single-copy markers we find in bacteria/archaea. So a similar method could be used to estimate purity of eukaryotic bins, perhaps?
It might be a good idea to include contigs unclassified on the kingdom level in the analysis. I have long suspected that a lot of the eukaryotic portion ends up there because there are relatively fewer eukaryotic genomes in the NCBI database.

One advantage to doing this is that I think it would broaden the appeal of Autometa, it would be an interesting project for a student or outside contributor, and I have been asked about this several times at meetings.

If an outside contributor is interested in this - please let us know because it might be better to work together, and also base PRs off dev rather than main because it is pretty different right now (it is Python3 and most of the code has been refactored).

evanroyrees commented 4 years ago

There is a standard gene set that eukaryotic genomics efforts use to determine how complete their genome is. Sorry I don't have a citation to hand. However, while this would give us estimated completeness, I'm not sure whether it would give us purity because I'm not sure if they are all single copy.

Perhaps busco is a tool for this?

hyphaltip commented 4 years ago

Yes BUSCO sets would make sense. These are HMMs with an expected length and a bitscore cutoff that was calibrated to avoid overcalling paralogs.

We can try a couple of euk predictors. One thing we do in funannotate is train gene predictors with BUSCO gene sets. I'd be willing to try a couple of scenarios. We have a low complexity (only 1-2 eukaryote) lichen datasets that might be a good test set to try this on.

hyphaltip commented 4 years ago

Sorry to be slow on this. I didn't see the mention and the summer has been crazy. But I'd love to work on this some with you.

evanroyrees commented 4 years ago

Hi @hyphaltip, thanks for your willingness to help out on this. I've put together a couple links regarding the comments above. Are there any other euk predictors you would suggest? If so, would you mind listing them? Thanks!

Resources

Datasets available from BUSCO - link
Funannotate GH page - link

tderond commented 4 years ago

This would be super cool. I'd use this feature if it's implemented.

evanroyrees commented 3 years ago

Hi @hyphaltip, could you provide a link to these test datasets? I think we have a few members in the lab that would be interested in trying to tackle this as a little side project.

KwanLab / Autometa

Enable binning of eukaryotic genomes #95

Resources