apcamargo / taxopy

A Python package for obtaining complete lineages and the lowest common ancestor (LCA) from a set of taxonomic identifiers
https://apcamargo.github.io/taxopy
GNU General Public License v3.0
38 stars 5 forks source link

Supporting edge contraction #9

Open Midnighter opened 2 years ago

Midnighter commented 2 years ago

In order to make the NCBI taxonomy more comparable with another one that simply consists of the ranks: domain, phylum, clade, order, family, genus, species; I would like to contract the taxonomy to those specific ranks. This should be possible with a smart use of edge contraction and/or node/edge addition/deletion as necessary.

Do you see this as a useful enhancement? Similarly, I've been converting the taxonomy a networkx DiGraph in order to get the subtree from a specific node.

apcamargo commented 2 years ago

Yes, I think that's useful. Good idea.

I've considered representing storing the TaxDb data in graphs before, but gave up on the idea to avoid adding another requirement (networkx). But database manipulation is a great argument in favor of using graphs. I would need to evaluate memory usage before, though.

The major limitation right now is that I don't have enough time at the moment to implement more features. I want to dedicate more time to taxopy, but for the next couple months I'll be busy with other projects.

If you need this feature soon, my suggestion is to use taxonkit's reformat command to generate a tabular file with the ranks you are interested in. You can then use this as input for taxonkit create-taxdump, which will generate a new taxdump you can load into taxopy.

Midnighter commented 2 years ago

Cheers, that's a great idea to use taxonkit for this.

With regard to adding new features, I know the pain of limited time so I understand completely. If you ever look at graphs graph-tool might be a more performant option than networkx.

apcamargo commented 2 years ago

Thanks for your suggestion!

TaxonKit is super useful. I use it combined with taxopy all the time.