dib-lab / charcoal

Remove contaminated contigs from genomes using k-mers and taxonomies.
Other
52 stars 1 forks source link

brainstorm how to evaluate the impact of "bad" databases/taxonomy on performance #77

Open ctb opened 4 years ago

ctb commented 4 years ago

we are using GTDB taxonomy because -

we are not using NCBI because:

but it would be good to evaluate all of this more clearly.

one thought might be to find and characterize "high rank" k-mers (k-mers above match rank; see e.g. this blog post) from the database. sourmash_databases has some code to do this.

ctb commented 4 years ago

36 is potentially relevant, in terms of doing some kind of cross-MAG analysis within charcoal.

ctb commented 4 years ago

continuing down the rabbit hole of sourmash-oddify and high-rank k-mers, it might be interesting to see if we can quantify the impact of charcoal decontam by looking at what kinds of high-rank k-mers they would have introduced into the databases.