broadinstitute / pyfrost

Python bindings for Bifrost with a NetworkX compatible API
BSD 3-Clause "New" or "Revised" License
27 stars 1 forks source link

Networkx API design #1

Closed winni2k closed 3 years ago

winni2k commented 4 years ago

Hi! I'm opening this issue to discuss the networkx-compatible API design.

I've done something like this for Cortex and McCortex files over at https://github.com/winni2k/cortexpy, and I am interested in which approach you were hoping to take. So far I see two options: 1. load bifrost data into RAM and turn it into a dict-of-dict-of-dicts to be wrapped by nx.Graph, or 2. write a nx.Graph or nx.DiGraph duck type that loads data from bifrost on disk on demand.

  1. is probably faster to implement, but very RAM intensive. 2. is a bit trickier but probably the right thing to do? I had a go at 2. in cortexpy but I never quite got it right. It's unfortunate that there isn't an abstract base class for nx.Graph and its compatriots.
lrvdijk commented 4 years ago

I'm currently working on option 2. See tests/test_graph.py for the currently implemented API. I think it should be doable to get it in a state such that most NetworkX algorithms work. For now, I'm focusing on colored compacted De Bruijn graphs only, and ignore the non-color variant.

Things I still need to think about:

winni2k commented 4 years ago

Thanks for getting back to me!

My understanding of networkx is that it really wants both succ and pred methods to be available. I notice that pred does not seem to be tested in https://github.com/broadinstitute/pyfrost/blob/master/tests/test_graph.py. Is pred non-trivial to implement from a bifrost data structure?

lrvdijk commented 4 years ago

Ha no, pred is working for the most part, but one of my tests was failing, and it seems that my understanding of the data structure is still incomplete (see https://github.com/pmelsted/bifrost/issues/17), so I have some refactoring to do :)

As mentioned, this project is in a very very early stage, and still in active development.

winni2k commented 4 years ago

Did you manage to resolve the issue with pred, @lrvdijk?

I was going to try and apply some networkx algorithms such as all_simple_paths to pyfrost with an eye to creating a tutorial section on graph travelsal. I think I can borrow many tests for graph traversal from cortexpy. Should I write them up and make a PR? Do I understand correctly that you would expect algorithms to work out of the box if both succ and pred are implemented?

winni2k commented 4 years ago

A side question: Do you think that networkx.subgraph_view would work with a pyfrost graph?

lrvdijk commented 4 years ago

Yes pred works, as do many graph functions like neighbors(), predecessors(), successors() etc.

Here in the tests folder I use networkx's breadth first search and depth first search on the graph: https://github.com/broadinstitute/pyfrost/blob/master/tests/test_dfs_bfs.py

I haven't extensively tested other networkx algorithms yet, the coming weeks I will use this library in my own research projects and see what missing features are the most important, and prioritize based on that what I will develop next. But in general, I believe most NetworkX algorithms are written in such a way that is pretty agnostic to the underlying data structure, as long as the algorithms can obtain node metadata and its neighbors.

The builtin subgraph_view is not going to work with the current pyfrost implementation, as it assigns the private _succ and _pred dicts of a normal networkx.DiGraph to filtered dicts, which doesn't work for pyfrost because it doesn't have dicts as base data structure. I'll have to come up with a custom implementation, which is definitely high on my list, but need to flesh it out a bit more.

dschult commented 4 years ago

If you are using a dict-like structure for an adjacency_list approach to storing your graph data, then check out the "dict_factory" features of NetworkX's Graph and DiGraph classes. You should be able to "just" drop in your backend in place of the NetworkX Python dict-of-dict-of-dict and have all algorithms work.

If your graph structure is not adjacency oriented, then you'll have more work to make a NetworkX interface. :}

winni2k commented 4 years ago

Thanks @dschult!

lrvdijk commented 3 years ago

I've now landed a refactor of the code that supports subgraph_view. I emulate the _succ, _pred and _node dicts in C++, but other than that it's just a proper NetworkX DiGraph. See the latest beta release!