Open jason-c-kwan opened 4 years ago
There is a standard gene set that eukaryotic genomics efforts use to determine how complete their genome is. Sorry I don't have a citation to hand. However, while this would give us estimated completeness, I'm not sure whether it would give us purity because I'm not sure if they are all single copy.
Perhaps busco is a tool for this?
Yes BUSCO sets would make sense. These are HMMs with an expected length and a bitscore cutoff that was calibrated to avoid overcalling paralogs.
We can try a couple of euk predictors. One thing we do in funannotate is train gene predictors with BUSCO gene sets. I'd be willing to try a couple of scenarios. We have a low complexity (only 1-2 eukaryote) lichen datasets that might be a good test set to try this on.
Sorry to be slow on this. I didn't see the mention and the summer has been crazy. But I'd love to work on this some with you.
This would be super cool. I'd use this feature if it's implemented.
Hi @hyphaltip, could you provide a link to these test datasets? I think we have a few members in the lab that would be interested in trying to tackle this as a little side project.
Since we are not eukaryotic experts perhaps @hyphaltip would have some useful suggestions on how to implement this, but here are my thoughts:
One advantage to doing this is that I think it would broaden the appeal of Autometa, it would be an interesting project for a student or outside contributor, and I have been asked about this several times at meetings.
If an outside contributor is interested in this - please let us know because it might be better to work together, and also base PRs off
dev
rather thanmain
because it is pretty different right now (it is Python3 and most of the code has been refactored).