Databases - Githubissues

HadrienG commented 6 years ago

Meta-issues for database-related experimental design discussions

[x] How many genomes?
[x] Which genomes to select?

jhayer commented 6 years ago

It depends on the datasets that we are gonna simulate. I guess when we will add unknown sequences, it might be worth is to use nr (but it will be too big to build for some tools like Kraken for ex.)

HadrienG commented 6 years ago

it might be worth is to use nr

why? The only reason for it that I see is that we may find edge-case or organism that are hard to classify.

A subset could speed up the analyses greatly, and still is able to assess how well software X is able to (i) classify a sequence present in the database, (ii) classify unknown sequence at a higher taxonomic rank, (iii) not mis-classify sequences and (iv) eventually assign unknown if too distant.

(iv) is even easier with a subset since we would not need to "invent" genomes

As I wrote in the Readme as preliminary info there are currently 23839 complete genomes in the ncbi databases.

Those genomes are distributed as follow:

bacteria	viruses	archaea	total / %
9655	13892	292	23839
0.405	0.583	0.012	1

Taking per example, 1000 genomes would mean having a small db of 405 bacteria, 583 viruses and 12 archaea.

A small db would allow us to run lots of simulations and get confidence intervals for the % of classified sequences.

HadrienG commented 6 years ago

The difficulty I see with taking a subset is that I can imagine two completely different approaches, that both have their disadvantages:

We try to be representative of the species distribution in the public databases. Intuitive but a big downside is that we may end up with too few species to accurately see mis-classification at the species level.
We eliminate some clades. I like this and the only downside I can see is that we might miss some "hard to classify" organism, in the case of, let's say a classifier is really bad a detecting mycoplasma for some obscure reason

jhayer commented 6 years ago

my reason for nr: at least for the viruses, quite many have sequences in GenBank or ENA but do not have a complete genome, so if we use more unknown sequences, those will not be detected with the genomes DB (that was my reason for adding all viral sequences in my Kraken DB). But the down side is that it also bring more false positives (sequences deposited but wrongly annotated). Need to think more about your 2 options, but small DB seems more reasonable indeed.

HadrienG commented 6 years ago

I hear what you are saying for "real-life" metagenomics but in this experiment's setup there is no such thing as a truly unknown sequence. All sequences present in the dataset should be known by us, and then present - or absent - of the database we'll use.

In that regard what would the advantage of taking "known unknown" sequences from nr instead of complete genomes?

jhayer commented 6 years ago

ok, then probably none

Ackia commented 6 years ago

After some thinking, I agree with Hadrien on this. My gut feeling is with Juliette, but I can not see a reason for complicating a simulation to the point of closing in on real-life problems for a situation where we want to create a framework for testing methodologies.

I have some reasons for this: a) We would need to have an accurate method of simulating unknowns. That we do not so there is no specific reason to put known "unknowns" in there. They are often just badly annotated sequences. b) This idea can be further expanded upon later on by adding such a tool. That is a great secondary article down the road. c) There might be a reason to use other databases, but since we have no proof of that at the current state, we should progress with the RefSeq database. d) Given the procedures here, adding more databases at a later stage (if reviewers so want for example) should be easily doable.

jhayer commented 6 years ago

agreed 👍 And which one of the 2 solutions for reducing the DB? I still don't know... Maybe the 2nd one, removing some clades. But we can discuss that a bit more.

HadrienG commented 6 years ago

So here is solution 3 as we discussed earlier: we take the representative genomes from RefSeq, which will greatly reduce the database and eliminate the strain variation that is not necessarily interesting for us. That is granted we decide to benchmark the classifiers at the species level.*

This is sort of similar to solution 1 with the advantage that it will keep species that are underrepresented in the databases and trim away the big groups where lots of strains are sequenced.

This leaves us with

1541 bacterial genomes
140 archaea
0 viruses (!!!)

For getting viruses with this solution, we can take the database of curated viruses created by the CDC (paging @Ackia for a reference on this)

*: What we could do is to add a small separate db with, i.e. all e.coli strain and add a section on strain identification.

Ackia commented 6 years ago

Database for representative viruses https://hive.biochemistry.gwu.edu/rvdb

HadrienG commented 6 years ago

So it is big. Like genbank + RefSeq big, which means 605,974 sequences. It's probably an awesome db for the metaviromics we do but I think it is overkill for this.

Should we consider not subsetting and going with RefSeq?

RefSeq assembly stats (complete genomes) -- 20180321:

division	n genomes	size (bytes)
bacteria	9135	10.5Gb
archaea	251	185.4Mb
viruses	7,491	78.4Mb
Total	16877	10.8Gb

We definitely can build the databases (although I'm not sure about salmon), we've built nt(40Gb) for kraken on the cluster. I'm only afraid that some analyses will be long (I'm looking at you blast)

HadrienG commented 5 years ago

Updated stats:

division	n genomes
bacteria	13193
archaea	283
viruses	8583
Total	16877

HadrienG / 2019_classifiers_benchmark

Databases #3