Extra rows and taxaIDs?

susheelbhanu commented 1 month ago

Hi,

Firstly, thank you for this amazing tool! I have a question regarding possible duplicates when running name2taxid on a larger list.

My list (column: name) below contains 17836 elements

>>> agrii_tax
             ASV    domain             phylum             class               order               family                   genus  species                    name
0          ASV_1  Bacteria        Chloroflexi            KD4-96                 NaN                  NaN                     NaN      NaN                  KD4-96
1          ASV_2  Bacteria  Verrucomicrobiota  Verrucomicrobiae  Chthoniobacterales  Chthoniobacteraceae  Candidatus Udaeobacter      NaN  Candidatus Udaeobacter
2          ASV_3  Bacteria         Firmicutes           Bacilli          Bacillales                  NaN                     NaN      NaN              Bacillales
3          ASV_4  Bacteria         Firmicutes           Bacilli          Bacillales          Bacillaceae                     NaN      NaN             Bacillaceae
4          ASV_5  Bacteria        Chloroflexi            KD4-96                 NaN                  NaN                     NaN      NaN                  KD4-96
...          ...       ...                ...               ...                 ...                  ...                     ...      ...                     ...
17831  ASV_17832  Bacteria                NaN               NaN                 NaN                  NaN                     NaN      NaN                Bacteria
17832  ASV_17833       NaN                NaN               NaN                 NaN                  NaN                     NaN      NaN                     NaN
17833  ASV_17834       NaN                NaN               NaN                 NaN                  NaN                     NaN      NaN                     NaN
17834  ASV_17835  Bacteria    Planctomycetota    Planctomycetes        Pirellulales        Pirellulaceae            Pir4 lineage      NaN            Pir4 lineage
17835  ASV_17836  Bacteria   Actinobacteriota    Actinobacteria       Micrococcales    Microbacteriaceae                     NaN      NaN       Microbacteriaceae

[17836 rows x 9 columns]

However, when i run the name2taxid conversion on them I get the following:

>>> taxid_results = pytaxonkit.name2taxid(names)
>>> taxid_results
                         Name    TaxID    Rank
0                      KD4-96     <NA>    <NA>
1      Candidatus Udaeobacter  1921511   genus
2                  Bacillales     1385   order
3                 Bacillaceae   186817  family
4                      KD4-96     <NA>    <NA>
...                       ...      ...     ...
21965                Bacteria   629395   genus
21966                    <NA>     <NA>    <NA>
21967                    <NA>     <NA>    <NA>
21968            Pir4 lineage     <NA>    <NA>
21969       Microbacteriaceae    85023  family

21969 rows in the results compared to 17835 in the input. Is it possible that some 'names' are getting duplicate taxaIDs?

Thank you for your help with this, Susheel

standage commented 1 month ago

Hi @susheelbhanu.

What are the contents of the names variable? Can you confirm it is agrii_tax.name and indeed has 17836 elements?

I'm not sure what would cause the results to be larger than the input. Which version of TaxonKit and PyTaxonKit do you have installed? pytaxonkit.__version__ and pytaxonkit.__taxonkitversion__

susheelbhanu commented 1 month ago

Hey @standage,

Thanks for the quick reply. Here are the versions:

>>> pytaxonkit.__taxonkitversion__
'taxonkit v0.17.0'
>>> pytaxonkit.__version__
'0.8'

And this is what names contains

>>> names[:10]
['KD4-96', 'Candidatus Udaeobacter', 'Bacillales', 'Bacillaceae', 'KD4-96', 'Candidatus Udaeobacter', 'Candidatus Nitrocosmicus', 'Micrococcaceae', 'MB-A2-108', 'Gaiella']

standage commented 1 month ago

What is the length of names?

susheelbhanu commented 1 month ago

17835

standage commented 1 month ago

So it has 1 less element than agrii_tax, which has 17836 rows?

susheelbhanu commented 1 month ago

Sorry typo..

>>> length_of_names = len(names)
>>>
>>> print("Length of names:", length_of_names)
Length of names: 17836

susheelbhanu commented 1 month ago

Sorry typo..

>>> length_of_names = len(names)
>>>
>>> print("Length of names:", length_of_names)
Length of new_names: 17836

standage commented 1 month ago

This is unexpected behavior indeed. I'm not sure there's much more I can do unless you can share the entire contents of names in text file. I'll note that there appear to be quite a few NaN names which will give empty results. But if you are trying to maintain the correct shape of your data, I understand you may not want to drop those values. I don't think this is causing the issue, but again, I don't have enough information at the moment to be sure.

susheelbhanu commented 1 month ago

Thank you, I'm happy to share the file later tonight. And yes, I'm trying to keep the shape so as to merge it later with another file.

Appreciate your help with this!

susheelbhanu commented 1 month ago

Here's the file and how I get the names list. TaxaId16s.csv

import pytaxonkit, os
import pandas as pd

# reading in the 16S taxa
agrii_tax = pd.read_csv("TaxaId16s.csv", header = 0)

# dropping the unnamed column
agrii_tax = agrii_tax.drop(columns=['Unnamed: 0'])

# Rename the 'ASVrank' column to 'ASV'
agrii_tax = agrii_tax.rename(columns={'ASVrank': 'ASV'})

# Move 'ASV' to the first column
cols = ['ASV'] + [col for col in agrii_tax.columns if col != 'ASV']
agrii_tax = agrii_tax[cols]

# Create a new column 'name' by finding the last non-NaN value in each row
agrii_tax['name'] = agrii_tax[['species', 'genus', 'family', 'order', 'class', 'phylum', 'domain']].bfill(axis=1).iloc[:, 0]

# Replace NaN values in the 'name' column with 'unclassified'
agrii_tax['name'].fillna('unclassified', inplace=True)

# Extract the 'name' column from your DataFrame
names = agrii_tax['name'].tolist()

# Run pytaxonkit.name2taxid with the names
taxid_results = pytaxonkit.name2taxid(names)

# To view the results
print(taxid_results)

Thank you!

standage commented 1 month ago

Ok, I understand the issue a bit better now. It doesn't appear to be an issue with TaxonKit or PyTaxonKit, but an artifact of the NCBI Taxonomy.

To investigate, I discarded all of the unclassified values, kept the remaining unique values, and performed the name2taxid query. As with your example, the output was larger than the input.

>>> mynames = list(set([n for n in names if n != "unclassified"]))
>>> len(mynames)
827
>>> taxid_results = pytaxonkit.name2taxid(mynames)
>>> taxid_results
                     Name    TaxID   Rank
0           Aeromicrobium     2040  genus
1            Pir2 lineage     <NA>   <NA>
2       Streptosporangium     2000  genus
3             Pedosphaera  1032526  genus
4         Polycyclovorans  1274363  genus
..                    ...      ...    ...
843             Duganella    75654  genus
844             Emticicia   312278  genus
845  Pleurocapsa PCC-7319     <NA>   <NA>
846            GWC2-45-44     <NA>   <NA>
847      Cyanobacteriales     <NA>   <NA>

[848 rows x 3 columns]

So there must be some duplicated values. I found them with the following code.

>>> taxid_results[taxid_results.Name.duplicated(keep=False)].sort_values("Name")
               Name    TaxID          Rank
418  Actinobacteria   201174        phylum
417  Actinobacteria   201174        phylum
762         Archaea     2157  superkingdom
761         Archaea     2157  superkingdom
214        Bacillus     1386         genus
215        Bacillus    55087         genus
830        Bacteria        2  superkingdom
831        Bacteria        2  superkingdom
832        Bacteria   629395         genus
82            Bosea    85413         genus
83            Bosea   169215         genus
768     Chloroflexi   200795        phylum
767     Chloroflexi    32061         class
568   Cyanobacteria     1117        phylum
567   Cyanobacteria     1117        phylum
177    Diplosphaera   381755         genus
178    Diplosphaera  1148783         genus
331      Firmicutes     1239        phylum
332      Firmicutes     1239        phylum
736        Gordonia    79255         genus
735        Gordonia     2053         genus
416          Labrys  2066135         genus
415          Labrys   204476         genus
584      Leptothrix       88         genus
585      Leptothrix  1907117         genus
758      Longispora   203522         genus
759      Longispora  2759766         genus
380      Nitrospira     1234         genus
381      Nitrospira   203693         class
187      Paracoccus      265         genus
188      Paracoccus   249411         genus
802  Planctomycetes      112         order
803  Planctomycetes   203682        phylum
804  Planctomycetes   203683         class
792  Proteobacteria     1224        phylum
793  Proteobacteria     1224        phylum
227     Rhodococcus     1827         genus
228     Rhodococcus  1661425         genus
311      Syntrophus  1671858         genus
310      Syntrophus    43773         genus

It turns out that some of these names are associated with multiple entries in the NCBI taxonomy files (names.dmp). Some of these entries are redundant (same name from different sources with the same taxid) while some names actually refer to different taxa. I'm afraid that resolving these nomenclature issues to identify the "correct" taxid for each name is outside the scope of pytaxonkit.

susheelbhanu commented 1 month ago

Thank you so much @standage! I understand that it's outside the scope, but at least it's good to know where the discrepancy is coming from.

Appreciate your prompt help with this.

Best, Susheel

bioforensics / pytaxonkit

Extra rows and taxaIDs? #38