dib-lab / khmer

In-memory nucleotide sequence k-mer counting, filtering, graph traversal and more
http://khmer.readthedocs.io/
Other
757 stars 295 forks source link

Python API #776

Open alexjironkin opened 9 years ago

alexjironkin commented 9 years ago

Hi, I think it would be awesome to have khmer provide API for other Python code to use. This is an extension of the conversation from this post and discussion.

Here is an example of what I mean. Let's take script/filter-abund.py. It would be useful to do something like this from external code:

from khmer import filter_abund

filtered_data = filter_abund(input_table, input_filename, force=True, cut_off=2)
# Do something with the result.

For this you will need filter_abund to look something like this:

def filter_abund(input_table, input_filename, kwargs**) {
    # Code from the main() in filter-abund.py that does the work
    return name, trim_seq
}

or

def filter_abund(**kwargs) {
    # This will allow you to pass args dictionary directly. As long as it's documented what are the
    #    parameters to the function are, I think it would work.
}

Which also means filter-abund.py becomes:

from khmer import filter_abund
def main() {
    args = get_parser().parse_args()
    .
    .
    .
    result = filter_abund(args.input_table, args.input_filename, force=args.force)
    # or
    result = filter_abund(args)
    # Print result/output to the file
}

Currently, there is a lot of code in khmer/init.py, where I think the above code could go to make it importable from khmer. So maybe it can go into khmer.api, and import becomes from khmer.api import filter_abund. Personally, I think that a little ugly, but it could work. Perhaps, additional parameter would be whether to output to file, and only write it when it's set to True?

Alex

ctb commented 9 years ago

Hey @alexjironkin, thanks for coming by!

We're definitely moving towards that kind of API, aided by our general interest in streaming (see #393 in particular). The oxli API should look something like that (khmer 4.0, basically - see http://khmer.readthedocs.org/en/latest/roadmap.html).

A few additional thoughts --

Hmm, but this gets rather complicated. So maybe chunk it out,

  1. turn as many of our scripts into functions as possible.
  2. build read input and read output objects (@camillescott is doing some of this)
  3. make the functions composable
  4. start tracking metadata and filtering details for input sources as reads pass through
  5. allow specification of constraints on processing
  6. start building an optimizer that can build the right pipeline for a specific type of processing

Obviously this is pretty long term, but the first 4 are pretty do-able.

kdm9 commented 9 years ago

While we're in a long term mood, it would be fantastic if we could build the khmer/oxli core C++ library as an independent library, and package it as such. This would allow novel projects, especially those tangential to the goals of khmer, to link to a great Count-Min sketch backed kmer counting library, including all the associated functionality that is in khmer's C++ core.

Basically I'm thinking of a khmer/oxli version of what jellyfish exposed with libjellyfish-dev and friends.

Then I could do

g++ -loxli -o cool_stuff cool_stuff.cc

without ever needing to interfere with the development of olix/khmer, and without needing to go through python (which, when we start doing some of the things we're trying to do, would be helpful).

From what I understand, it "should" be as simple getting:

cd lib
make

to give us liboxli.so, but it may be more complex. Thoughts?

If this is something you feel is valuable, I'm happy to set up a Cmake-based config that allows one to build the core library in a portable way, independent from (and without breaking) python setup.py x.

Cheers, Kevin

alexjironkin commented 9 years ago

+1 for C++ library and Python bindings to it. Essentially all computation is done in C/C++ and Python exposes this through API. Although, that's more complicated involved than just exposing current functionality through API.

A thought on chaining: Depending on the inputs for each function, you can define a list of steps, retrieve function defs with the same name (from API ;) ) into a list, then give this list together with the original input data to each function in order. I think that would allow chaining of arbitrary functions in Python, as defined by user?!

kdm9 commented 9 years ago

@ctb @mr-c is anyone actively working on function-ising the load-into-counting.py script? If not, would you like me to have a crack at it during this sprint #751?

And/or possibly a generic (probably class-based) approach to #676 (metadata logging)?

Cheers, Kevin

mr-c commented 9 years ago

@kdmurray91 Not specifically. @bocajnotnef is trying to functionize a couple scripts over in #690

So, sure, it'll be interesting to see what you come up with!

mr-c commented 9 years ago

@kdmurray91 re: c++ library: sure, but with no versioning guarentees. We'd rather see lib/Makefile updated than introduce a dependency on cmake.

kdm9 commented 9 years ago

RE ./lib, I agree that CMake would be a bit heavy for what is involved.

In fact, I've got a branch somewhere where I've added a libkhmer.so (and .a) target, and clean up the make file a bit.

Will draft a PR now.

mr-c commented 9 years ago

Great, thanks!

On Thu, Feb 19, 2015, 18:51 Kevin Murray notifications@github.com wrote:

RE ./lib, I agree that CMake would be a bit heavy for what is involved.

In fact, I've got a branch somewhere where I've added a libkhmer.so (and .a) target, and clean up the make file a bit.

Will draft a PR now.

— Reply to this email directly or view it on GitHub https://github.com/ged-lab/khmer/issues/776#issuecomment-75164197.

mr-c commented 9 years ago

Re: C++ library. Now that #788 is merged I am happy to accept PRs that add Doxygen comments and/or move code out of the header files.