Closed rcedgar closed 4 years ago
So 421/957 is about right?
Sure.
This needs to be updated for the website as well.
@rcedgar just wondering, where did you get the number 957? Couldn't find that number documented anywhere else so I did a query on Coronaviridae but found 962 species:
coronaviridae_query = "txid11118[Organism:exp]"
with Entrez.esearch(db='taxonomy', term=coronaviridae_query, retmax=3000) as handle:
response = Entrez.read(handle)
ids = response['IdList']
with Entrez.efetch(db='taxonomy', id=ids) as handle:
response = Entrez.read(handle)
species_list = [entry for entry in response if entry['Rank'] == 'species']
print(len(species_list))
# 962
Am I missing something here?
I have a local copy of the taxonomy tree. I do my own searches through to tree structure in Python. My copy is probably out of date and/or my tree-walking logic is not equivalent to your Entrez query. For the mission statement, there is no meaningful difference between 957 and 962, but both are more than double the 436 in the original.
Sure, thanks for the clarification. The numbers are marginally off but it's more about the fact that they are different. I just think it's nice to have something explicit to back the numbers that are put up.
I agree, by all means let's go with your number.
On the topic of reproducible numbers, I don't quite understand the method of clustering at 97% to get 421 species. This method seems to lose information for which of the 957 (or 962) species have a complete genome. For example, if genomes from two different species were 98% similar, wouldn't the clustering method treat the two genomes as coming from the same species?
I approached this differently: since each entry in complete_cov_genomes.txt has a taxonomy ID associated with it, I checked for the amount of unique taxids and observed the complete genomes come from 410 taxids. Again, marginal difference from 421 but the approach is very different.
The only problem: some of the associated taxids don't seem to be specific down to the species rank. All taxids at least fall under the Coronaviridae family, but if we assume each taxid represents a unique species then we could say that 410 Coronaviridae species have complete genomes.
I have a pretty limited biological background and I'm sure the 97% was a deliberate decision and it'd be cool to know the reasoning. Just sharing my thoughts :)
This is like trying to reason what is better, emacs
or vim
. At the end of the day you need to state your reasons and multiple approaches are valid since there isn't a universal catch-all that fits everyone. And deep down we know vim
is the right answer, but we tolerate those who disagree.
Neat. I like writing python scripts in mspaint
but I guess vim
is alright.
The numbers of known Cov species and how many full-length genomes are not well-defined numbers, they are fuzzy because of arbitrary definitions of "species" for viruses (is Pluto a planet?), unnamed species (how to count them?) and other biological complications. My point about the issue was that the numbers in the readme were way wrong; now we're in the right ballpark and that's the main thing IMO -- this issue is solved 100% +/- 9.5%.
Thanks for the insight @ababaian @rcedgar. Guess I need to be more comfortable with the fuzziness of biology.
There are 957 Cov species in the NCBI taxonomy tree.
There are 4,236 complete genomes of Coronaviridae species in Genbank, which is ~4 complete genomes per species. List here:
complete_cov_genomes.txt
After removing duplicate entries, there are 672 distinct genomes. After clustering at 97% identity, there are 421. The project README.md should be adjusted accordingly.