CAMI-challenge / OPAL

OPAL: Open-community Profiling Assessment tooL
https://cami-challenge.github.io/OPAL/
Apache License 2.0
25 stars 6 forks source link

Strains in the gold standard? #30

Closed luizirber closed 4 years ago

luizirber commented 4 years ago

Hello,

I was looking at the CAMI II MG gold standard and noticed lines with strains (which is also listed in the ranks. Since the gold standard doesn't mention, what taxonomy was used? It seems to be NCBI-like or derived, but I thought NCBI doesn't assign strain tax IDs since 2014?

547042  strain  2|976|200643|171549|815|816|387090|547042   Bacteria|Bacteroidetes|Bacteroidia|Bacteroidales|Bacteroidaceae|Bacteroides|Bacteroides coprophilus 0.0 4447072.1

547042 is listed as Rank: no rank in the NCBI Taxonomy

Another example:

36874.1 strain  2|976|200643|171549|171551|836|36874|36874.1    Bacteria|Bacteroidetes|Bacteroidia|Bacteroidales|Porphyromonadaceae|Porphyromonas|Porphyromonas cangingivalis|Unclassified Porphyromonas cangingivalis strain   0.0 187028.0

36874.1 is not a valid NCBI Tax ID, so how was it assigned?

luizirber commented 4 years ago
547042    strain  2|976|200643|171549|815|816|387090|547042   Bacteria|Bacteroidetes|Bacteroidia|Bacteroidales|Bacteroidaceae|Bacteroides|Bacteroides coprophilus 0.0 4447072.1

547042 is listed as Rank: no rank in the NCBI Taxonomy

So for this one I implemented some logic to check for no rank and figure out if it is a strain, subspecies in gather_to_opal.py, and can get this call properly now.

36874.1   strain  2|976|200643|171549|171551|836|36874|36874.1    Bacteria|Bacteroidetes|Bacteroidia|Bacteroidales|Porphyromonadaceae|Porphyromonas|Porphyromonas cangingivalis|Unclassified Porphyromonas cangingivalis strain   0.0 187028.0

36874.1 is not a valid NCBI Tax ID, so how was it assigned?

But this case is still confusing, and I'm curious to how it was added to the gold standard =]

fernandomeyer commented 4 years ago

In the CAMI gold standards, the strain IDs just reflect how many strains there are for a certain species. For instance, if there are 5 strains of Escherichia coli, then their strain IDs will be 562.1, 562.2, 562.3, 562.4, and 562.5. Which strain gets a certain number is random and therefore it is impossible to assess relative abundance predictions at the strain level, but at least we can assess if the number of predictions is similar as in the gold standard. Since there doesn't seem to be a established way to define strain IDs, this is the best that we can do right not. Anyway, OPAL just matches taxon IDs, and how they are defined is not really relevant for the tool itself.

luizirber commented 4 years ago

In the CAMI gold standards, the strain IDs just reflect how many strains there are for a certain species. For instance, if there are 5 strains of Escherichia coli, then their strain IDs will be 562.1, 562.2, 562.3, 562.4, and 562.5. Which strain gets a certain number is random and therefore it is impossible to assess relative abundance predictions at the strain level, but at least we can assess if the number of predictions is similar as in the gold standard. Since there doesn't seem to be a established way to define strain IDs, this is the best that we can do right not. Anyway, OPAL just matches taxon IDs, and how they are defined is not really relevant for the tool itself.

I see, thanks! I will still keep the info I can get from levels below species in the NCBI taxonomy, but not worry about the specific taxon ID formatting (it will still be summarized properly for higher levels anyway).