GATB / gatb-core

Core library of the Genome Analysis Toolbox with de-Bruijn graph
https://gatb.inria.fr/software/gatb-core/
63 stars 26 forks source link

Python bindings ? #12

Open Piezoid opened 7 years ago

Piezoid commented 7 years ago

Is there someone working on this ?

The only reference I found is on the GATB Global Architecture page :

Wrappers for other languages will be available in the future

I needed some basic scripting capabilities for exploring GATB's graphs so I wrote a simple Cython wrapper around gatb::core::debruijn::impl::Graph. Here is a demo. Code will be available soon.

Suggestions or questions are welcomed.

rchikhi commented 7 years ago

very nice!! noone else was working on it, yet we were often mentioning the possibility during our meetings. Thanks for creating this wrapper. Please commit it to gatb-core along with the demo.

Piezoid commented 7 years ago

I did this for the needs of my internship, mentored by Pierre Peterlongo. The wrapper covers a minimal part of Graph, supporting only few read-only operations. My usage showed that this is enough for graph inspection and prototyping extension algorithms.

Whith time, the interface gets more dogfooding and I'll probably release it next week. @pgdurand wants to talk about it before. I've not yet decided if I should release in a standalone repository or within gatb-core. I use this cmake module for compiling the Cython module. Some remarks about integration in gatb-core :

Anyway, the scope of the python module should be redefined for other uses. Here a short list of what might be wanted :

I won't have the time to implement those, but the scope of the interface might define how the wrapper is integrated in the project and it's build process.

Piezoid commented 7 years ago

I've just published the wrapper : here is the repository.

rchikhi commented 7 years ago

do you mean header-only library?

Piezoid commented 7 years ago

Yes. I must say, I don't quite like the fact that it's not possible to opt-out the visitor based interface in GraphTemplate. I would like to visit the GraphDataVariant once and run my (specialized template) algorithms from there. Instead the graph API "pattern match" the variant for each nucleotide traversed. Callgrind show that's about 1/3 of the cycles are spent visiting variants and copying GraphVectors.

To my understanding boost::variants are tagged union types. Consequently, sizeof(Integer) is the size of the largest kmer + a tag. The API could hand LargeInt<MAX_SIZE> to the user for the same effect.

Maybe I'm missing something ? The GraphTemplate<Node<Integer>,...> is said to be legacy API. But what are the alternatives ? It seems I can't find any gatb project using it. Projects like MindTheGap use span-specialized templates but still use the GraphTemplate<Node<Integer>,...> interface.

The Graph.cpp contains lots of duplicated code. It could leverage more of kmer::model iterators such that the code would be small enough to be header-only. This would enable RVO/Copy-elision for most cases.

GraphVector could be replaced with an iterator interface. Specialized iterators/cursors could be made for some applications : for example, when drawing simple paths, the iterator could remember the previous node and spare one contains() call when counting predecessors.

Sorry I'm a bit demanding on zero-cost abstractions since I learnt Rust :)

rizkg commented 7 years ago

zero-cost abstractions would be great :) We should discuss together your ideas on how to replace boost::variant

rchikhi commented 7 years ago

thanks for the extensive response.

Yes, you were missing something regarding the GraphTemplate instantiated with variants. The alternative is instantiation without the variant type, but instead with a single kmer size, i.e. directly passing a single templated integer to the GraphUnitigsTemplate class. The two projects that implement this alternative are bcalm: https://github.com/GATB/bcalm/blob/master/src/bcalm_1.cpp#L52 minia: https://github.com/GATB/minia/blob/master/src/Minia.cpp#L116 For your project, you can use the NodeFast/EdgeFast/GraphDataVariantFast alternative, as in https://github.com/GATB/minia/blob/master/src/Minia.cpp#L115. Indeed, for performance, it doesn't make sense to use boost::variant's anymore. These were introduced initially for convenience (and also because we didn't know it had performance drawbacks).

rchikhi commented 7 years ago

We'd be happy to have a header-only library, and in fact, refactor gatb-core. Would you be interested in being involved?

Piezoid commented 7 years ago

Thanks for the answer. I didn't know that GraphUnitigs had a better interface. The trio NodeFast/EdgeFast/GraphDataVariantFast still uses variants, right ? Maybe single alternative variants get optimized away ?

We'd be happy to have a header-only library, and in fact, refactor gatb-core. Would you be interested in being involved? Yes I'd like to, but I don't think I will have time to do so during my internship. As always, tons of ideas but not enough time to implement them...

I'm thinking about a cursor-like API. For example a base abstract class can be derived for each span-specialization. The key point is avoiding handing kmer back and forth to user code. Since kmers are of variable size, it complexifies the API. Kmers are often treated as black boxes in client code. Instead, users should get nucleotides, simple-paths, unitigs and be able to navigate in the graph.

Maybe a more efficient alternative would be specializing user template code for each span with a visitor pattern (but the whole user's algorithm is run inside one visit).

About the cursor API, it could allow some optimizations:

Ideally the user should be allowed to compose templates for choosing features (like the STL algorithms).

On a smaller scale, I rewrote GraphVector taking advantage of the recent addition of std=c++11. The ensued changes in GraphTemplate is still a bit messy...

rchikhi commented 7 years ago

GraphUnitigs has the same interface as GraphTemplate, actually.

NodeFast/EdgeFast/GraphDataVariantFast don't use boost::variants, they are just a regular data type. The trick is that the program needs to instantiate them based on the k-mer size given at command-line parameter. Thus, instead of checking that k-mer size each time a graph function is called, it is only checked once, in the main(). A typical program doesn't use variable k-mer sizes that differ by more than +-1. Thus it is okay to instantiate once and for all a graph template that supports one desired maximal k-mer size. This is a helpful constraint that would simplify the design of the library.

I don't know what cursor-like APIs are, would you have any good pointer to some documentation?

By the way nice job with the GraphVector rewrite! I'd be happy to take a pull request of all your changes at some point.

Piezoid commented 7 years ago

You're right NodeFast and EdgeFast wrap plain LargeInt. But GraphTemplate<NodeFast,EdgeFast,GraphDataVariantFast> still use the visitor pattern and in src/gatb/debruijn/impl/Graph.hpp we have:

using GraphDataVariantFast = boost::variant<GraphData >; Of course branching in the visitor should be optimized by the compiler.

Say I have a algorithm :

template<size_t span> class MyAlgo<span> {
public: void operator() (GraphUnitig<span>&);
};

How can i run it on an h5 input file that can be configured for any supported kmer size ? I need some sort of visitor pattern to specialize my code for each possible span, right ?

What I mean by cursor-like is an iterator really. Only there might be multiples choices of successors when calling next(). The general idea is small composable components, enclosing a minimalist mutable state that represent a position in a bigger data structure.

rchikhi commented 7 years ago

Ah right yes, I guess in order to keep compatibility with the rest of the code in Graph.cpp, GraphDataVariantFast had to be kept as a boost::variant.

Regarding your toy example,yes it seems that the only way to run this code is to have it compiled for as many span's values as needed, and the relevant one will be selectd at runtime given a user-specified k-mer size. I.e. as it is right now in GATB.

Sorry, I still didn't understand the iterator/cursor-like design. What successors are you talking about?