Closed LanderDC closed 2 months ago
I tested this on my own data and it seems to give the expected output without issues.
Thank you for working on this! I'll take a look at it within the next couple of days.
This feature would make my life so much easier!
I'm very sorry I forgot this PR. I just pushed a commit to simplify the code, avoiding the need of an extra function and a new dictionary. I also skipped the genus and species ranks, since geNomad won't assign genomes to those ranks anyway.
Now I need to:
Viruses;
to the begging of the string (i.e. are there cases where the majority vote would not have no rank
as the highest node?)No problem!
To comment on your second bullet point, it has been a while, but from my understanding the no rank
(from genomad's DB nodes.dmp
) corresponds to root
(from names.dmp
). The resulting dictionary from taxopy
's rank_name_dictionary
does not have this no rank/root
classification, while taxopy
's name_lineage
(currently being used by genomad
) does have it. Currently, root
is replaced by Viruses
(see: https://github.com/apcamargo/genomad/blob/af91a6a1e9defa4dd70e225c7c8efdc67a644bf0/genomad/taxonomy.py#L61-L62), effectively being the same as appending Viruses
(because the root
classification is present in all name_lineage
lists for all taxids.
The only problem I saw, was that if the majority taxon call was equal to 1
(root/no rank
), taxopy
's rank_name_dictionary
would return an empty dictionary which would give problems downstream (that's why I added the add_empty_keys
function). But all of that also seems to be solved with your changes.
I'm done with updating the documentation. This PR will be merged as soon as I got some time to prepare the new release :)
PR for issue #39. Changes to
taxonomy.py
include:taxopy
'srank_name_dictionary
instead ofname_lineage
on the majority taxon.nodes.dmp
) ordered from lowest to highest taxon level.