dib-lab / khmer

In-memory nucleotide sequence k-mer counting, filtering, graph traversal and more
http://khmer.readthedocs.io/
Other
757 stars 295 forks source link

Introduce consume function(s) with a mask #1637

Open standage opened 7 years ago

standage commented 7 years ago

It would be helpful to have a consume_seqfile function (and maybe a consume_string function) that accepts a mask as an argument and does not consume a sequence if it has a perfect match in the mask. Underlying implementation could be naive (using std::string::find to search against chromosome sequences stored as a collection of std::string objects) or more clever (using something like a suffix tree).

standage commented 7 years ago

One consideration about architecture (since @ctb loves software architecting): you could think of this feature as a filter on the consume function. The banding stuff is currently implemented as a separate function, but could also be implemented as a filter. And unless we want an exponential explosion of consume functions (with or without banding, with or without a mask, with or without some other filter that becomes super critical a month from now) we should probably refactor the main consume functions to accept one or more filters somehow.

standage commented 7 years ago

The counter-argument would be the need to get something working and demonstrate that it's useful with as little friction as possible, and worry about code organization later.

betatim commented 7 years ago

How about creating functions in C++ that you can call to configure and that return a filter that you pass in consume_seeqfile(..., filter=khmer.filter.select_band(8, 3))? From a software POV I like it...how many more of these cases can we think of? If there is only three maybe not worth building this, but if there is potentially a long list that people find useful.

Would be best to experiment a bit with the interface and how to keep things efficient. Certainly easier if we have cython, so I'd wait till that has landed.

standage commented 7 years ago

Yeah, something like that could work, we would just need to make sure it could accept multiple filters. I certainly want to be able to toggle the banding independently of toggling the mask.

As far as timing, agreed. This isn't going anywhere near master until the cython branch is merged.

ctb commented 7 years ago

Additional thought: a gentle approach to this would be to incrementally refactor the C++ internals to allow this, much like we are doing with the k > 32/multiple hash functions.

camillescott commented 7 years ago

Take a look at the base NodeGatherer and kmer_filters.hh for an example of how to implement this sort of thing. One (as of yet untested...) benefit of this is you should be able to write function pointers in pure cython which can be cast to std::function.

On Sat, Feb 25, 2017 at 6:55 AM, C. Titus Brown notifications@github.com wrote:

Additional thought: a gentle approach to this would be to incrementally refactor the C++ internals to allow this, much like we are doing with the k

32/multiple hash functions.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/dib-lab/khmer/issues/1637#issuecomment-282488930, or mute the thread https://github.com/notifications/unsubscribe-auth/ACwxrdaZY300gAQYSRPul-CFVQusnfELks5rgEDkgaJpZM4MJVET .

-- Camille Scott

Graduate Group for Computer Science Lab for Data Intensive Biology University of California, Davis

camille.scott.w@gmail.com