Open ctb opened 4 years ago
an alternative is to provide two modes, 'strict' and 'permissive'; default to 'permissive'; suggest that people look at the reports, and adjust as needed.
we could also permit/provide genome specific configuration of this and other parameters (see #30 for another situation where this might be needed). that is, provide sensible defaults and let people fine tune across the entire data set and/or specific genomes.
with genome specific configuration we could also target removal of specific lineages, e.g. if you see species s__NORP36, kill the contig
. this starts to get into fine detail tho :(.
(I kind of worry that people are going to look at the final gather report on the clean contigs and say, "...but I want 100% clean! why can't I have 100% clean?")
if we provided a genome specific configuration we could allow people to forcibly keep specific contigs, which might be useful (e.g. if they've been validated in some way)
Looking through the Tara delmont bin decontamination report, I can see many situations where a single hashvalue is being used to remove a large contig, e.g.
This seems overly stringent for environmental MAGs. A few thoughts --
I came into this issue thinking that we should use presets, but I kind of like the idea of auto-tuning now.
ref PR #31, adjusting thresholds; and #25, specifying taxonomic levels to do decontam at above genus level.