Open ctb opened 4 years ago
continuing down the rabbit hole of sourmash-oddify and high-rank k-mers, it might be interesting to see if we can quantify the impact of charcoal decontam by looking at what kinds of high-rank k-mers they would have introduced into the databases.
we are using GTDB taxonomy because -
we are not using NCBI because:
but it would be good to evaluate all of this more clearly.
one thought might be to find and characterize "high rank" k-mers (k-mers above match rank; see e.g. this blog post) from the database. sourmash_databases has some code to do this.