dib-lab / charcoal

Remove contaminated contigs from genomes using k-mers and taxonomies.
Other
52 stars 1 forks source link

choosing parameters based on type of sample - environmental vs human microbiome vs ... #32

Open ctb opened 4 years ago

ctb commented 4 years ago

Looking through the Tara delmont bin decontamination report, I can see many situations where a single hashvalue is being used to remove a large contig, e.g.

---- contig TARA_PSE_MAG_00111_000000000004 (25 kb)
contig dirty, REASON 2 - contig lineage is not a match to genome's genus
lineage is s__Novosphingobium mathurense

** hashval lca counts
   1 kb s__Novosphingobium mathurense

** hashval lineage counts - 1
   1 kb s__Novosphingobium mathurense

This seems overly stringent for environmental MAGs. A few thoughts --

I came into this issue thinking that we should use presets, but I kind of like the idea of auto-tuning now.

ref PR #31, adjusting thresholds; and #25, specifying taxonomic levels to do decontam at above genus level.

ctb commented 4 years ago

an alternative is to provide two modes, 'strict' and 'permissive'; default to 'permissive'; suggest that people look at the reports, and adjust as needed.

we could also permit/provide genome specific configuration of this and other parameters (see #30 for another situation where this might be needed). that is, provide sensible defaults and let people fine tune across the entire data set and/or specific genomes.

ctb commented 4 years ago

with genome specific configuration we could also target removal of specific lineages, e.g. if you see species s__NORP36, kill the contig. this starts to get into fine detail tho :(.

ctb commented 4 years ago

(I kind of worry that people are going to look at the final gather report on the clean contigs and say, "...but I want 100% clean! why can't I have 100% clean?")

ctb commented 4 years ago

if we provided a genome specific configuration we could allow people to forcibly keep specific contigs, which might be useful (e.g. if they've been validated in some way)