how should we handle 'N's in DNA, in our scripts and our codebase?

ctb commented 9 years ago

see #1001 for inspiration.

Currently, we turn them into As (explicitly) or Gs (by default).

@drtamermansour suggests that we ignore reads containing Ns (see https://github.com/dib-lab/khmer/pull/1001#issuecomment-104034103)

We could also ignore k-mers containing Ns but count the rest of the read.

Whatever we do, I think we must output number of skipped k-mers and/or reads; otherwise this behavior will be opaque to the user.

drtamermansour commented 9 years ago

I would like to highlight the importance of this problem. The current exome dataset I am working with for variant calling has 22% of its reads containing Ns. Different aligners and variant calling software would behave differently with such ambiguous reads.

ctb commented 9 years ago

My thinking has been that, going forward, we should pass Ns on unchanged as much as possible; it doesn't seem like it should be our job to "fix" them. @drtamermansour do you have a specific recommendation for what to do about Ns in the output, as opposed to counting them/not counting them internally?

drtamermansour commented 9 years ago

Totally agree. This is not the job of diginorm. Subsequent trimming or error correction modules should handle this.

On Tue, Jun 2, 2015 at 6:46 AM, C. Titus Brown notifications@github.com wrote:

My thinking has been that, going forward, we should pass Ns on unchanged as much as possible; it doesn't seem like it should be our job to "fix" them. @drtamermansour https://github.com/drtamermansour do you have a specific recommendation for what to do about Ns in the output, as opposed to counting them/not counting them internally?

— Reply to this email directly or view it on GitHub https://github.com/dib-lab/khmer/issues/1036#issuecomment-107925917.

ctb commented 9 years ago

On Tue, Jun 02, 2015 at 04:50:41AM -0700, Tamer Mansour wrote:

Totally agree. This is not the job of diginorm. Subsequent trimming or error correction modules should handle this.

I'm skeptical that ANYTHING should be done automatically in the case of filtering before variant calling!

drtamermansour commented 9 years ago

Yes. This is what I mean. I am saying that khmer as a package could provide other solutions for further pre-processing.

On Tue, Jun 2, 2015 at 6:56 AM, C. Titus Brown notifications@github.com wrote:

On Tue, Jun 02, 2015 at 04:50:41AM -0700, Tamer Mansour wrote:

Totally agree. This is not the job of diginorm. Subsequent trimming or error correction modules should handle this.

I'm skeptical that ANYTHING should be done automatically in the case of filtering before variant calling!

— Reply to this email directly or view it on GitHub https://github.com/dib-lab/khmer/issues/1036#issuecomment-107927531.

ctb commented 9 years ago

Just to follow through on this a bit more: there are three issues.

How do we handle reads containing Ns internally?
How do we transform reads containing Ns by default?
What options do we provide for transforming Ns in reads for those who want to do so?

The answers so far seem to be:

(1) is what we're discussing in this issue;

(2) is "we do as little as possible";

(3) is open to discussion; so far the main idea is to provide error correction of some sort that replaces Ns with a corrected base.

ctb commented 8 years ago

The proper way to deal with this might be to make it depend on the hash function. Some hash functions will easily support more characters than others...

ctb commented 8 years ago

This is something we could handle differently by providing different hash functions (#27).

standage commented 8 years ago

Based on my most recent testing, it looks to me like khmer is ignoring reads containing Ns. This was discussed quite a bit in the various relevant threads, but there didn't seem to be consensus on whether this was the best approach. Maybe I missed something.

It's moot though if we switch to an encoding/hashfunction that supports ambiguous nucleotides.

ctb commented 8 years ago

On Tue, Sep 06, 2016 at 09:48:21PM -0700, Daniel Standage wrote:

Based on my most recent testing, it looks to me like khmer is ignoring reads containing Ns. This was discussed quite a bit in the various relevant threads, but there didn't seem to be consensus on whether this was the best approach. Maybe I missed something.

It's moot though if we switch to an encoding/hashfunction that supports ambiguous nucleotides.

How did you test this? The code in 'normalize-by-median' replaces Ns with As, so it may well be specific to script (which is suboptimal, yes).

standage commented 8 years ago

Scripts may override this somehow, but the hashtable->consume_fasta function does indeed discard any sequence with an N in it. I first encountered this with my de novo simulation code, for which I linked directly against the C++ API for performance reasons. But I just created a stripped-down example using the Python API to demonstrate the same behavior at the Python level. See https://gist.github.com/standage/474fc27f2a5607141717c180b9019852.

ctb commented 8 years ago

Indeed it does.

standage commented 8 years ago

There are a few possible solutions. It there is some reason to stick with the current 2-bit representation, we can simply consume all k-mers that contain only ACGT. If we go with a 3+ bit representation, we can actually consume and handle Ns, and maybe other IUPAC ambiguity symbols.

Relevant to #1441.

betatim commented 8 years ago

Order of magnitude, how many other letters/symbols are there?

Going forward it definitely seems like a good thing to keep around other letters instead of discarding them (silently).

standage commented 8 years ago

There are on the order of 16 symbols (see here). In fact, there are exactly 16 if we ignore the Z/zero symbol, which 1) I've never heard of before or seen being used and 2) doesn't make sense in the context of DNA sequencing.

In all honesty, everything but {A, C, G, T, N} is quite rare, but some assemblers (such as AllPaths-LG) report them at a non-trivial rate.

ctb commented 8 years ago

There is no sense in paying any attention to anything other than ACGT, I think; there is no sensible and general way to load them into a DBG. They really mean "wildcard here" which doesn't fit with anything we want to do.

However, I think we should refrain from eliminating or modifying the original sequence in any way when we can do so. So, for example, in trimming, we can make trimming decisions based on uppercasing and converting all weird characters to (for example) G, but we should not change the case or the characters in what we output. Same for diginorm, and partitioning, and much of our current functionality.

We should try to make it clear(er) when we are ignoring or transforming sequence, though.

ctb commented 8 years ago

On Tue, Sep 13, 2016 at 03:59:43AM -0700, Tim Head wrote:

Going forward it definitely seems like a good thing to keep around other letters instead of discarding them (silently).

I think ignoring them is fine, and discarding them silently within the Python and C++ code is necessary. But we need to provide info via functions, alert the user to aggregate stats, and streamline our input handling code so that we're using the same function(s) to preprocess stuff.

ctb commented 8 years ago

Side note: there are various classes in trim-low-abund that are quite handy for read processing, and we could/should extract them (see #1262) and make them available for scripts more generally.

In trim-low-abund I'm thinking specifically of ReadBundle and clean_up_reads.

This code could then be a single point of change to do things "correctly", is my real point.

dib-lab / khmer

how should we handle 'N's in DNA, in our scripts and our codebase? #1036