Open susheelbhanu opened 1 month ago
Hi @susheelbhanu.
What are the contents of the names
variable? Can you confirm it is agrii_tax.name
and indeed has 17836 elements?
I'm not sure what would cause the results to be larger than the input. Which version of TaxonKit and PyTaxonKit do you have installed? pytaxonkit.__version__
and pytaxonkit.__taxonkitversion__
Hey @standage,
Thanks for the quick reply. Here are the versions:
>>> pytaxonkit.__taxonkitversion__
'taxonkit v0.17.0'
>>> pytaxonkit.__version__
'0.8'
And this is what names contains
>>> names[:10]
['KD4-96', 'Candidatus Udaeobacter', 'Bacillales', 'Bacillaceae', 'KD4-96', 'Candidatus Udaeobacter', 'Candidatus Nitrocosmicus', 'Micrococcaceae', 'MB-A2-108', 'Gaiella']
What is the length of names
?
17835
So it has 1 less element than agrii_tax
, which has 17836 rows?
Sorry typo..
>>> length_of_names = len(names)
>>>
>>> print("Length of names:", length_of_names)
Length of names: 17836
Sorry typo..
>>> length_of_names = len(names)
>>>
>>> print("Length of names:", length_of_names)
Length of new_names: 17836
This is unexpected behavior indeed. I'm not sure there's much more I can do unless you can share the entire contents of names
in text file. I'll note that there appear to be quite a few NaN
names which will give empty results. But if you are trying to maintain the correct shape of your data, I understand you may not want to drop those values. I don't think this is causing the issue, but again, I don't have enough information at the moment to be sure.
Thank you, I'm happy to share the file later tonight. And yes, I'm trying to keep the shape so as to merge it later with another file.
Appreciate your help with this!
Here's the file and how I get the names
list.
TaxaId16s.csv
import pytaxonkit, os
import pandas as pd
# reading in the 16S taxa
agrii_tax = pd.read_csv("TaxaId16s.csv", header = 0)
# dropping the unnamed column
agrii_tax = agrii_tax.drop(columns=['Unnamed: 0'])
# Rename the 'ASVrank' column to 'ASV'
agrii_tax = agrii_tax.rename(columns={'ASVrank': 'ASV'})
# Move 'ASV' to the first column
cols = ['ASV'] + [col for col in agrii_tax.columns if col != 'ASV']
agrii_tax = agrii_tax[cols]
# Create a new column 'name' by finding the last non-NaN value in each row
agrii_tax['name'] = agrii_tax[['species', 'genus', 'family', 'order', 'class', 'phylum', 'domain']].bfill(axis=1).iloc[:, 0]
# Replace NaN values in the 'name' column with 'unclassified'
agrii_tax['name'].fillna('unclassified', inplace=True)
# Extract the 'name' column from your DataFrame
names = agrii_tax['name'].tolist()
# Run pytaxonkit.name2taxid with the names
taxid_results = pytaxonkit.name2taxid(names)
# To view the results
print(taxid_results)
Thank you!
Ok, I understand the issue a bit better now. It doesn't appear to be an issue with TaxonKit or PyTaxonKit, but an artifact of the NCBI Taxonomy.
To investigate, I discarded all of the unclassified values, kept the remaining unique values, and performed the name2taxid
query. As with your example, the output was larger than the input.
>>> mynames = list(set([n for n in names if n != "unclassified"]))
>>> len(mynames)
827
>>> taxid_results = pytaxonkit.name2taxid(mynames)
>>> taxid_results
Name TaxID Rank
0 Aeromicrobium 2040 genus
1 Pir2 lineage <NA> <NA>
2 Streptosporangium 2000 genus
3 Pedosphaera 1032526 genus
4 Polycyclovorans 1274363 genus
.. ... ... ...
843 Duganella 75654 genus
844 Emticicia 312278 genus
845 Pleurocapsa PCC-7319 <NA> <NA>
846 GWC2-45-44 <NA> <NA>
847 Cyanobacteriales <NA> <NA>
[848 rows x 3 columns]
So there must be some duplicated values. I found them with the following code.
>>> taxid_results[taxid_results.Name.duplicated(keep=False)].sort_values("Name")
Name TaxID Rank
418 Actinobacteria 201174 phylum
417 Actinobacteria 201174 phylum
762 Archaea 2157 superkingdom
761 Archaea 2157 superkingdom
214 Bacillus 1386 genus
215 Bacillus 55087 genus
830 Bacteria 2 superkingdom
831 Bacteria 2 superkingdom
832 Bacteria 629395 genus
82 Bosea 85413 genus
83 Bosea 169215 genus
768 Chloroflexi 200795 phylum
767 Chloroflexi 32061 class
568 Cyanobacteria 1117 phylum
567 Cyanobacteria 1117 phylum
177 Diplosphaera 381755 genus
178 Diplosphaera 1148783 genus
331 Firmicutes 1239 phylum
332 Firmicutes 1239 phylum
736 Gordonia 79255 genus
735 Gordonia 2053 genus
416 Labrys 2066135 genus
415 Labrys 204476 genus
584 Leptothrix 88 genus
585 Leptothrix 1907117 genus
758 Longispora 203522 genus
759 Longispora 2759766 genus
380 Nitrospira 1234 genus
381 Nitrospira 203693 class
187 Paracoccus 265 genus
188 Paracoccus 249411 genus
802 Planctomycetes 112 order
803 Planctomycetes 203682 phylum
804 Planctomycetes 203683 class
792 Proteobacteria 1224 phylum
793 Proteobacteria 1224 phylum
227 Rhodococcus 1827 genus
228 Rhodococcus 1661425 genus
311 Syntrophus 1671858 genus
310 Syntrophus 43773 genus
It turns out that some of these names are associated with multiple entries in the NCBI taxonomy files (names.dmp). Some of these entries are redundant (same name from different sources with the same taxid) while some names actually refer to different taxa. I'm afraid that resolving these nomenclature issues to identify the "correct" taxid for each name is outside the scope of pytaxonkit.
Thank you so much @standage! I understand that it's outside the scope, but at least it's good to know where the discrepancy is coming from.
Appreciate your prompt help with this.
Best, Susheel
Hi,
Firstly, thank you for this amazing tool! I have a question regarding possible duplicates when running
name2taxid
on a larger list.My list (column:
name
) below contains 17836 elementsHowever, when i run the
name2taxid
conversion on them I get the following:21969 rows in the results compared to 17835 in the input. Is it possible that some 'names' are getting duplicate taxaIDs?
Thank you for your help with this, Susheel