elegant-scipy / elegant-scipy-submissions

Submissions of code snippets for the book Elegant SciPy
13 stars 2 forks source link

Identifying errors in shotgun sequencing using word frequencies #11

Open ctb opened 9 years ago

ctb commented 9 years ago

The diginorm and streaming error trimming code snippets are both nice and clean and easy to understand -- see

https://github.com/ctb/2015-experimental-graphalign/blob/master/khmer_api.py

functions 'diginorm' and 'streamtrim' for pure Python implementations. I can produce much simpler versions that don't handle a biological detail (paired end reads).

The latter is a kind of a nice way to talk about identifying errors in high-coverage data, and is the subject of this preprint,

https://peerj.com/preprints/890/

I can put both in a less biological context pretty easily; let me know what you think.

hdashnow commented 9 years ago

Thanks @ctb

jni commented 9 years ago

@ctb very nice, thanks! Yes, I'd say paired-ends complicate things a bit too much for our readers. I also have no idea what the bool flags being yielded mean! But I should probably read the paper before I try to grok this.

To someone who is quite naive about real-world genomics, how feasible do you think it is to feed the stream into networkx to build a De Bruijn graph and do assembly with that? As with the paired reads, I'm not interested in handling the edge cases, but rather in nice, concise code that will pretty much work.

ctb commented 9 years ago

On Tue, Apr 28, 2015 at 04:59:16PM -0700, Juan Nunez-Iglesias wrote:

@ctb very nice, thanks! Yes, I'd say paired-ends complicate things a bit too much for our readers. I also have no idea what the bool flags being yielded mean! But I should probably read the paper before I try to grok this.

To someone who is quite naive about real-world genomics, how feasible do you think it is to feed the stream into networkx to build a De Bruijn graph and do assembly with that? As with the paired reads, I'm not interested in handling the edge cases, but rather in nice, concise code that will pretty much work.

I don't know much about networkx, but at least for linear sequences doing assembly is pretty straightforward.

The question I would ask is, what is a use case for streaming that's reasonably consonant with real-world practice? Assembly itself is still an offline problem, although we're working on it. The per-position error stuff in the streaming paper may be something that's of interest (if errors occur with higher frequency towards the end of the sequence, we can detect that with a streaming sublinear time/memory algorithm). Trimming is the other real-world app.