dib-lab / 2022-sra-gather

Classify all the metagenomes. ALL THE METAGENOMES. (Eventually.)
Other
0 stars 1 forks source link

thoughts about leveraging database covers for biology #11

Open taylorreiter opened 2 years ago

taylorreiter commented 2 years ago

So an immediate drawback of using covers to build databases is that strain-level identification, with the guarantee that the best strain will always be returned, disappears.

But a great side benefit is that if a species exists across biomes, the first match in both biomes will always be the first sketch the db cover encountered when being built. Then, any additional matches will represent chunks of sequence not in the original sketch, but still present in the environment.

What this might allow us to do is look at the species that are present across biomes, but look for specific matches that are only present in one biome. What I'm imagining is something like this:

E. coli k12 is in the gut and in soil. E. coli EHEC is only in gut E. coli MX is only in soil.

Might be cool for identifying strains, or at least genome sequences, that are specific to an environment (or at least seed hypotheses about this stuff)

(side note -- if these covers weren't built with GTDB reps first, it would be really good to have GTDB reps first. Then our fav model orgs will always get the biggest chunk of matches, which is a very useful thing.)