HadrienG / 2019_classifiers_benchmark

Benchmarking of metagenomic classifiers
5 stars 0 forks source link

Datasets #4

Open HadrienG opened 6 years ago

HadrienG commented 6 years ago

Meta-issues for datasets-related experimental design discussions

jhayer commented 6 years ago

At least 10 datasets I think. Abundance level / Sequencing instrument: all the options provided by ISS :-) Do we want mixed datasets only? Or should we try some bacteria only, viruses only, etc. as well?

HadrienG commented 6 years ago

At least 10 datasets I think.

Completely agree. Depending on the db size and the power we need we could go up to a hundred. But I think 10 is a great start

Abundance level / Sequencing instrument: all the options provided by ISS :-)

Abundance level might be a bad word choice. I meant amount of species in the datasets.

Do we want mixed datasets only? Or should we try some bacteria only, viruses only, etc. as well?

I don't see what non-mixed datasets brings to the experiment. Am I missing something obvious? 😄

jhayer commented 6 years ago

Amount of species: from 10 to 100 maybe?

For the non-mixed datasets, no I do not think that they bring anything. Just asking, because if I remember correctly they were easier to produce (if we do not provide the genomes, bu with the NCBI random option). But we do all mixed, that sounds good.

HadrienG commented 6 years ago

For the amount of species I'm on the opinion that we should stick to 1 to 3 different numbers, which would give us a design such as:

either we decide on fixed numbers for the diversities or we decide on ranges for the bins that will be needed for analysing the results.

Concerning the mixed / not-mixed questions yes not-mixed would be easier to produce with iss create --ncbi but I doubt we will end up producing from random genomes from the ncbi, rather than from random genomes drawn from a selected sub-database as discussed in #3

jhayer commented 6 years ago

I like this idea of the 3 different types. Fixed numbers or ranges: I do not know what is best. I guess it does not really matter.

Agreed for the second point, we will most likely not use the ncbi option

replikation commented 5 years ago