Closed luizirber closed 4 years ago
547042 strain 2|976|200643|171549|815|816|387090|547042 Bacteria|Bacteroidetes|Bacteroidia|Bacteroidales|Bacteroidaceae|Bacteroides|Bacteroides coprophilus 0.0 4447072.1
547042
is listed asRank: no rank
in the NCBI Taxonomy
So for this one I implemented some logic to check for no rank
and figure out if it is a strain, subspecies in gather_to_opal.py, and can get this call properly now.
36874.1 strain 2|976|200643|171549|171551|836|36874|36874.1 Bacteria|Bacteroidetes|Bacteroidia|Bacteroidales|Porphyromonadaceae|Porphyromonas|Porphyromonas cangingivalis|Unclassified Porphyromonas cangingivalis strain 0.0 187028.0
36874.1
is not a valid NCBI Tax ID, so how was it assigned?
But this case is still confusing, and I'm curious to how it was added to the gold standard =]
In the CAMI gold standards, the strain IDs just reflect how many strains there are for a certain species. For instance, if there are 5 strains of Escherichia coli, then their strain IDs will be 562.1, 562.2, 562.3, 562.4, and 562.5. Which strain gets a certain number is random and therefore it is impossible to assess relative abundance predictions at the strain level, but at least we can assess if the number of predictions is similar as in the gold standard. Since there doesn't seem to be a established way to define strain IDs, this is the best that we can do right not. Anyway, OPAL just matches taxon IDs, and how they are defined is not really relevant for the tool itself.
In the CAMI gold standards, the strain IDs just reflect how many strains there are for a certain species. For instance, if there are 5 strains of Escherichia coli, then their strain IDs will be 562.1, 562.2, 562.3, 562.4, and 562.5. Which strain gets a certain number is random and therefore it is impossible to assess relative abundance predictions at the strain level, but at least we can assess if the number of predictions is similar as in the gold standard. Since there doesn't seem to be a established way to define strain IDs, this is the best that we can do right not. Anyway, OPAL just matches taxon IDs, and how they are defined is not really relevant for the tool itself.
I see, thanks! I will still keep the info I can get from levels below species in the NCBI taxonomy, but not worry about the specific taxon ID formatting (it will still be summarized properly for higher levels anyway).
Hello,
I was looking at the CAMI II MG gold standard and noticed lines with
strains
(which is also listed in theranks
. Since the gold standard doesn't mention, what taxonomy was used? It seems to be NCBI-like or derived, but I thought NCBI doesn't assign strain tax IDs since 2014?547042
is listed asRank: no rank
in the NCBI TaxonomyAnother example:
36874.1
is not a valid NCBI Tax ID, so how was it assigned?