apcamargo / genomad

geNomad: Identification of mobile genetic elements
https://portal.nersc.gov/genomad/
Other
169 stars 17 forks source link

Include empty fields in taxonomy for easier parsing #66

Closed LanderDC closed 2 months ago

LanderDC commented 6 months ago

PR for issue #39. Changes to taxonomy.py include:

LanderDC commented 6 months ago

I tested this on my own data and it seems to give the expected output without issues.

apcamargo commented 6 months ago

Thank you for working on this! I'll take a look at it within the next couple of days.

nikolasbasler commented 3 months ago

This feature would make my life so much easier!

apcamargo commented 3 months ago

I'm very sorry I forgot this PR. I just pushed a commit to simplify the code, avoiding the need of an extra function and a new dictionary. I also skipped the genus and species ranks, since geNomad won't assign genomes to those ranks anyway.

Now I need to:

LanderDC commented 3 months ago

No problem!

To comment on your second bullet point, it has been a while, but from my understanding the no rank (from genomad's DB nodes.dmp) corresponds to root (from names.dmp). The resulting dictionary from taxopy's rank_name_dictionary does not have this no rank/root classification, while taxopy's name_lineage (currently being used by genomad) does have it. Currently, root is replaced by Viruses (see: https://github.com/apcamargo/genomad/blob/af91a6a1e9defa4dd70e225c7c8efdc67a644bf0/genomad/taxonomy.py#L61-L62), effectively being the same as appending Viruses (because the root classification is present in all name_lineage lists for all taxids.

The only problem I saw, was that if the majority taxon call was equal to 1 (root/no rank), taxopy's rank_name_dictionary would return an empty dictionary which would give problems downstream (that's why I added the add_empty_keys function). But all of that also seems to be solved with your changes.

apcamargo commented 2 months ago

I'm done with updating the documentation. This PR will be merged as soon as I got some time to prepare the new release :)