Open standage opened 7 years ago
One consideration about architecture (since @ctb loves software architecting): you could think of this feature as a filter on the consume function. The banding stuff is currently implemented as a separate function, but could also be implemented as a filter. And unless we want an exponential explosion of consume functions (with or without banding, with or without a mask, with or without some other filter that becomes super critical a month from now) we should probably refactor the main consume functions to accept one or more filters somehow.
The counter-argument would be the need to get something working and demonstrate that it's useful with as little friction as possible, and worry about code organization later.
How about creating functions in C++ that you can call to configure and that return a filter that you pass in consume_seeqfile(..., filter=khmer.filter.select_band(8, 3))
? From a software POV I like it...how many more of these cases can we think of? If there is only three maybe not worth building this, but if there is potentially a long list that people find useful.
Would be best to experiment a bit with the interface and how to keep things efficient. Certainly easier if we have cython, so I'd wait till that has landed.
Yeah, something like that could work, we would just need to make sure it could accept multiple filters. I certainly want to be able to toggle the banding independently of toggling the mask.
As far as timing, agreed. This isn't going anywhere near master until the cython branch is merged.
Additional thought: a gentle approach to this would be to incrementally refactor the C++ internals to allow this, much like we are doing with the k > 32/multiple hash functions.
Take a look at the base NodeGatherer and kmer_filters.hh for an example of how to implement this sort of thing. One (as of yet untested...) benefit of this is you should be able to write function pointers in pure cython which can be cast to std::function.
On Sat, Feb 25, 2017 at 6:55 AM, C. Titus Brown notifications@github.com wrote:
Additional thought: a gentle approach to this would be to incrementally refactor the C++ internals to allow this, much like we are doing with the k
32/multiple hash functions.
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/dib-lab/khmer/issues/1637#issuecomment-282488930, or mute the thread https://github.com/notifications/unsubscribe-auth/ACwxrdaZY300gAQYSRPul-CFVQusnfELks5rgEDkgaJpZM4MJVET .
-- Camille Scott
Graduate Group for Computer Science Lab for Data Intensive Biology University of California, Davis
camille.scott.w@gmail.com
It would be helpful to have a
consume_seqfile
function (and maybe aconsume_string
function) that accepts a mask as an argument and does not consume a sequence if it has a perfect match in the mask. Underlying implementation could be naive (usingstd::string::find
to search against chromosome sequences stored as a collection ofstd::string
objects) or more clever (using something like a suffix tree).