Open ctb opened 9 years ago
Thanks @ctb
@ctb very nice, thanks! Yes, I'd say paired-ends complicate things a bit too much for our readers. I also have no idea what the bool flags being yielded mean! But I should probably read the paper before I try to grok this.
To someone who is quite naive about real-world genomics, how feasible do you think it is to feed the stream into networkx to build a De Bruijn graph and do assembly with that? As with the paired reads, I'm not interested in handling the edge cases, but rather in nice, concise code that will pretty much work.
On Tue, Apr 28, 2015 at 04:59:16PM -0700, Juan Nunez-Iglesias wrote:
@ctb very nice, thanks! Yes, I'd say paired-ends complicate things a bit too much for our readers. I also have no idea what the bool flags being yielded mean! But I should probably read the paper before I try to grok this.
To someone who is quite naive about real-world genomics, how feasible do you think it is to feed the stream into networkx to build a De Bruijn graph and do assembly with that? As with the paired reads, I'm not interested in handling the edge cases, but rather in nice, concise code that will pretty much work.
I don't know much about networkx, but at least for linear sequences doing assembly is pretty straightforward.
The question I would ask is, what is a use case for streaming that's reasonably consonant with real-world practice? Assembly itself is still an offline problem, although we're working on it. The per-position error stuff in the streaming paper may be something that's of interest (if errors occur with higher frequency towards the end of the sequence, we can detect that with a streaming sublinear time/memory algorithm). Trimming is the other real-world app.
The diginorm and streaming error trimming code snippets are both nice and clean and easy to understand -- see
https://github.com/ctb/2015-experimental-graphalign/blob/master/khmer_api.py
functions 'diginorm' and 'streamtrim' for pure Python implementations. I can produce much simpler versions that don't handle a biological detail (paired end reads).
The latter is a kind of a nice way to talk about identifying errors in high-coverage data, and is the subject of this preprint,
https://peerj.com/preprints/890/
I can put both in a less biological context pretty easily; let me know what you think.