revisit/discuss genome-grist k-mer trimming

Over in https://github.com/dib-lab/genome-grist/pull/107#issuecomment-1019274050, @mr-eyes asked -

Wanted to ask, was there a specific criterion for selecting the abundtrim and trimming parameters? I can't imagine how it will biologically affect the results.

did you take a look at https://peerj.com/preprints/890/?

More - this trimming is not important for either sourmash gather or mapping, which are the two primary read-based analyses that genome-grist does. Read mapping is 'other' than k-mer approaches, and sourmash gather is reference based and lightweight so it basically doesn't care if there are lots of erroneous k-mers hanging out in the data set.

However, doing some kind of k-mer abundance trimming is important for cDBG-graph approaches like spacegraphcats. This is because every erroneous k-mer fragments the cDBG.

So it is nice to have genome-grist download the SRA metagenome and preprocess it for "free".

The default parameters in the trim-low-abund specify that only reads with an estimated k-mer coverage of 18 or higher will be trimmed (-Z 18 -V), at a k-mer abundance of 2 or lower (-C 3). There should be no "loss" of k-mers from low-abundance reads, which are important to retain for metagenomes.

We've used these parameters in a lot of publications and they were chosen and evaluated ages ago. I now have a much better intuition (and we have a lot more data and experience!) and I'm not sure there's a strong reason to revisit them now, but I'm game if someone has criteria on which to evaluate them. It'd be reassuring if nothing else ;)

dib-lab / genome-grist

revisit/discuss genome-grist k-mer trimming #141

199 removes abundance trimming from the default genome-grist workflow. Leaving this open 'til I integrate it into docs more better.