ababaian / serratus

Ultra-deep search for novel viruses
http://serratus.io
GNU General Public License v3.0
253 stars 33 forks source link

Serratus mission statement #87

Closed rcedgar closed 4 years ago

rcedgar commented 4 years ago

There are 957 Cov species in the NCBI taxonomy tree.

There are 4,236 complete genomes of Coronaviridae species in Genbank, which is ~4 complete genomes per species. List here:

complete_cov_genomes.txt

After removing duplicate entries, there are 672 distinct genomes. After clustering at 97% identity, there are 421. The project README.md should be adjusted accordingly.

image

ababaian commented 4 years ago

So 421/957 is about right?

rcedgar commented 4 years ago

Sure.

victorlin commented 4 years ago

This needs to be updated for the website as well.

@rcedgar just wondering, where did you get the number 957? Couldn't find that number documented anywhere else so I did a query on Coronaviridae but found 962 species:

coronaviridae_query = "txid11118[Organism:exp]"

with Entrez.esearch(db='taxonomy', term=coronaviridae_query, retmax=3000) as handle:
    response = Entrez.read(handle)
    ids = response['IdList']

with Entrez.efetch(db='taxonomy', id=ids) as handle:
    response = Entrez.read(handle)

species_list = [entry for entry in response if entry['Rank'] == 'species']
print(len(species_list))
# 962

Am I missing something here?

rcedgar commented 4 years ago

I have a local copy of the taxonomy tree. I do my own searches through to tree structure in Python. My copy is probably out of date and/or my tree-walking logic is not equivalent to your Entrez query. For the mission statement, there is no meaningful difference between 957 and 962, but both are more than double the 436 in the original.

victorlin commented 4 years ago

Sure, thanks for the clarification. The numbers are marginally off but it's more about the fact that they are different. I just think it's nice to have something explicit to back the numbers that are put up.

rcedgar commented 4 years ago

I agree, by all means let's go with your number.

victorlin commented 4 years ago

On the topic of reproducible numbers, I don't quite understand the method of clustering at 97% to get 421 species. This method seems to lose information for which of the 957 (or 962) species have a complete genome. For example, if genomes from two different species were 98% similar, wouldn't the clustering method treat the two genomes as coming from the same species?

I approached this differently: since each entry in complete_cov_genomes.txt has a taxonomy ID associated with it, I checked for the amount of unique taxids and observed the complete genomes come from 410 taxids. Again, marginal difference from 421 but the approach is very different.

The only problem: some of the associated taxids don't seem to be specific down to the species rank. All taxids at least fall under the Coronaviridae family, but if we assume each taxid represents a unique species then we could say that 410 Coronaviridae species have complete genomes.

I have a pretty limited biological background and I'm sure the 97% was a deliberate decision and it'd be cool to know the reasoning. Just sharing my thoughts :)

ababaian commented 4 years ago

This is like trying to reason what is better, emacs or vim. At the end of the day you need to state your reasons and multiple approaches are valid since there isn't a universal catch-all that fits everyone. And deep down we know vim is the right answer, but we tolerate those who disagree.

victorlin commented 4 years ago

Neat. I like writing python scripts in mspaint but I guess vim is alright.

rcedgar commented 4 years ago

The numbers of known Cov species and how many full-length genomes are not well-defined numbers, they are fuzzy because of arbitrary definitions of "species" for viruses (is Pluto a planet?), unnamed species (how to count them?) and other biological complications. My point about the issue was that the numbers in the readme were way wrong; now we're in the right ballpark and that's the main thing IMO -- this issue is solved 100% +/- 9.5%.

victorlin commented 4 years ago

Thanks for the insight @ababaian @rcedgar. Guess I need to be more comfortable with the fuzziness of biology.