Wanted to ask, was there a specific criterion for selecting the abundtrim and trimming parameters? I can't imagine how it will biologically affect the results.
More - this trimming is not important for either sourmash gather or mapping, which are the two primary read-based analyses that genome-grist does. Read mapping is 'other' than k-mer approaches, and sourmash gather is reference based and lightweight so it basically doesn't care if there are lots of erroneous k-mers hanging out in the data set.
However, doing some kind of k-mer abundance trimming is important for cDBG-graph approaches like spacegraphcats. This is because every erroneous k-mer fragments the cDBG.
So it is nice to have genome-grist download the SRA metagenome and preprocess it for "free".
The default parameters in the trim-low-abund specify that only reads with an estimated k-mer coverage of 18 or higher will be trimmed (-Z 18 -V), at a k-mer abundance of 2 or lower (-C 3). There should be no "loss" of k-mers from low-abundance reads, which are important to retain for metagenomes.
We've used these parameters in a lot of publications and they were chosen and evaluated ages ago. I now have a much better intuition (and we have a lot more data and experience!) and I'm not sure there's a strong reason to revisit them now, but I'm game if someone has criteria on which to evaluate them. It'd be reassuring if nothing else ;)
Over in https://github.com/dib-lab/genome-grist/pull/107#issuecomment-1019274050, @mr-eyes asked -
More - this trimming is not important for either sourmash gather or mapping, which are the two primary read-based analyses that genome-grist does. Read mapping is 'other' than k-mer approaches, and sourmash gather is reference based and lightweight so it basically doesn't care if there are lots of erroneous k-mers hanging out in the data set.
However, doing some kind of k-mer abundance trimming is important for cDBG-graph approaches like spacegraphcats. This is because every erroneous k-mer fragments the cDBG.
So it is nice to have genome-grist download the SRA metagenome and preprocess it for "free".
The default parameters in the trim-low-abund specify that only reads with an estimated k-mer coverage of 18 or higher will be trimmed (
-Z 18 -V
), at a k-mer abundance of 2 or lower (-C 3
). There should be no "loss" of k-mers from low-abundance reads, which are important to retain for metagenomes.We've used these parameters in a lot of publications and they were chosen and evaluated ages ago. I now have a much better intuition (and we have a lot more data and experience!) and I'm not sure there's a strong reason to revisit them now, but I'm game if someone has criteria on which to evaluate them. It'd be reassuring if nothing else ;)