Open ctb opened 9 years ago
I would like to highlight the importance of this problem. The current exome dataset I am working with for variant calling has 22% of its reads containing Ns. Different aligners and variant calling software would behave differently with such ambiguous reads.
My thinking has been that, going forward, we should pass Ns on unchanged as much as possible; it doesn't seem like it should be our job to "fix" them. @drtamermansour do you have a specific recommendation for what to do about Ns in the output, as opposed to counting them/not counting them internally?
Totally agree. This is not the job of diginorm. Subsequent trimming or error correction modules should handle this.
On Tue, Jun 2, 2015 at 6:46 AM, C. Titus Brown notifications@github.com wrote:
My thinking has been that, going forward, we should pass Ns on unchanged as much as possible; it doesn't seem like it should be our job to "fix" them. @drtamermansour https://github.com/drtamermansour do you have a specific recommendation for what to do about Ns in the output, as opposed to counting them/not counting them internally?
— Reply to this email directly or view it on GitHub https://github.com/dib-lab/khmer/issues/1036#issuecomment-107925917.
On Tue, Jun 02, 2015 at 04:50:41AM -0700, Tamer Mansour wrote:
Totally agree. This is not the job of diginorm. Subsequent trimming or error correction modules should handle this.
I'm skeptical that ANYTHING should be done automatically in the case of filtering before variant calling!
Yes. This is what I mean. I am saying that khmer as a package could provide other solutions for further pre-processing.
On Tue, Jun 2, 2015 at 6:56 AM, C. Titus Brown notifications@github.com wrote:
On Tue, Jun 02, 2015 at 04:50:41AM -0700, Tamer Mansour wrote:
Totally agree. This is not the job of diginorm. Subsequent trimming or error correction modules should handle this.
I'm skeptical that ANYTHING should be done automatically in the case of filtering before variant calling!
— Reply to this email directly or view it on GitHub https://github.com/dib-lab/khmer/issues/1036#issuecomment-107927531.
Just to follow through on this a bit more: there are three issues.
The answers so far seem to be:
(1) is what we're discussing in this issue;
(2) is "we do as little as possible";
(3) is open to discussion; so far the main idea is to provide error correction of some sort that replaces Ns with a corrected base.
The proper way to deal with this might be to make it depend on the hash function. Some hash functions will easily support more characters than others...
This is something we could handle differently by providing different hash functions (#27).
Based on my most recent testing, it looks to me like khmer is ignoring reads containing Ns. This was discussed quite a bit in the various relevant threads, but there didn't seem to be consensus on whether this was the best approach. Maybe I missed something.
It's moot though if we switch to an encoding/hashfunction that supports ambiguous nucleotides.
On Tue, Sep 06, 2016 at 09:48:21PM -0700, Daniel Standage wrote:
Based on my most recent testing, it looks to me like khmer is ignoring reads containing Ns. This was discussed quite a bit in the various relevant threads, but there didn't seem to be consensus on whether this was the best approach. Maybe I missed something.
It's moot though if we switch to an encoding/hashfunction that supports ambiguous nucleotides.
How did you test this? The code in 'normalize-by-median' replaces Ns with As, so it may well be specific to script (which is suboptimal, yes).
Scripts may override this somehow, but the hashtable->consume_fasta
function does indeed discard any sequence with an N in it. I first encountered this with my de novo simulation code, for which I linked directly against the C++ API for performance reasons. But I just created a stripped-down example using the Python API to demonstrate the same behavior at the Python level. See https://gist.github.com/standage/474fc27f2a5607141717c180b9019852.
Indeed it does.
There are a few possible solutions. It there is some reason to stick with the current 2-bit representation, we can simply consume all k-mers that contain only ACGT. If we go with a 3+ bit representation, we can actually consume and handle Ns, and maybe other IUPAC ambiguity symbols.
Relevant to #1441.
Order of magnitude, how many other letters/symbols are there?
Going forward it definitely seems like a good thing to keep around other letters instead of discarding them (silently).
There are on the order of 16 symbols (see here). In fact, there are exactly 16 if we ignore the Z/zero symbol, which 1) I've never heard of before or seen being used and 2) doesn't make sense in the context of DNA sequencing.
In all honesty, everything but {A, C, G, T, N} is quite rare, but some assemblers (such as AllPaths-LG) report them at a non-trivial rate.
There is no sense in paying any attention to anything other than ACGT, I think; there is no sensible and general way to load them into a DBG. They really mean "wildcard here" which doesn't fit with anything we want to do.
However, I think we should refrain from eliminating or modifying the original sequence in any way when we can do so. So, for example, in trimming, we can make trimming decisions based on uppercasing and converting all weird characters to (for example) G, but we should not change the case or the characters in what we output. Same for diginorm, and partitioning, and much of our current functionality.
We should try to make it clear(er) when we are ignoring or transforming sequence, though.
On Tue, Sep 13, 2016 at 03:59:43AM -0700, Tim Head wrote:
Going forward it definitely seems like a good thing to keep around other letters instead of discarding them (silently).
I think ignoring them is fine, and discarding them silently within the Python and C++ code is necessary. But we need to provide info via functions, alert the user to aggregate stats, and streamline our input handling code so that we're using the same function(s) to preprocess stuff.
Side note: there are various classes in trim-low-abund that are quite handy for read processing, and we could/should extract them (see #1262) and make them available for scripts more generally.
In trim-low-abund I'm thinking specifically of ReadBundle
and clean_up_reads
.
This code could then be a single point of change to do things "correctly", is my real point.
see #1001 for inspiration.
Currently, we turn them into As (explicitly) or Gs (by default).
@drtamermansour suggests that we ignore reads containing Ns (see https://github.com/dib-lab/khmer/pull/1001#issuecomment-104034103)
We could also ignore k-mers containing Ns but count the rest of the read.
Whatever we do, I think we must output number of skipped k-mers and/or reads; otherwise this behavior will be opaque to the user.