Becksteinlab / kda

Python package used for the analysis of biochemical kinetic diagrams.

GNU General Public License v3.0

3 stars 1 forks source link

ENH: Changes diagram generation and probability calculations to work in terms of lists of edges instead of NetworkX diagrams #25

Closed nawtrey closed 3 years ago

nawtrey commented 3 years ago

Updates diagrams.generate_partial_diagrams() to only store edges if return_edges=True
Updates diagrams.generate_directional_partial_diagrams() to only store edges if return_edges=True
Changes various kda.calculations functions to use edges to lower memory footprint of functions
Optimizes code by replacing many python for-loops with numpy array operations for better performance
Addresses edge tuple portion of issue #2

Status

[ ] Ready to go

nawtrey commented 3 years ago

I wrote a small script to test the memory performance of master against this branch:

import numpy as np
import networkx as nx
from kda import calculations, diagrams, graph_utils

import time

@profile
def main(n_states):
    G = nx.MultiDiGraph()
    K = 10 * (np.random.random(size=(n_states, n_states)) - 0.5)
    K = 10 ** K
    K = np.reshape(K, newshape=(n_states, n_states))
    np.fill_diagonal(K, val=0)
    graph_utils.generate_edges(G, K, names=None)

    start = time.time()
    # for `master`
    # dirpar_edges = diagrams.generate_directional_partial_diagrams(G)
    # for `nawtrey_update_diagram_generation`
    dirpar_edges = diagrams.generate_directional_partial_diagrams(G, return_edges=True)
    kda_probs = calculations.calc_state_probs_from_diags(G, dirpar_edges, key="val")
    end = time.time()
    elapsed = end - start
    print(f"Elapsed time: {elapsed}")

if __name__ == "__main__":
    main(n_states=7)

I also added the memory_profiler @profile decorator to diagrams.generate_partial_diagrams(), diagrams.generate_directional_partial_diagrams(), and calculations.calc_state_probs_from_diags().

nawtrey commented 3 years ago

The memory_profiler results running on master:

mprof_7_state_master

The memory_profiler results running on nawtrey_update_diagram_generation:

mprof_7_state_dirpar_edges

nawtrey commented 3 years ago

This was ran for a maximally-connected 7-state diagram, which contains 21 unique connections. To calculate the state probabilities we have to generate 16,807 partial diagrams and 117,649 directional partial diagrams, which are a sufficient number to compare for the changes here.

There are 2 primary takeaways: 1.) the peak memory usage dropped from ~1500 MiB to ~350 MiB, or roughly 23%. 2.) the run time was reduced from 95.7 s to 76.7 s, roughly 80%.

codecov[bot] commented 3 years ago

Codecov Report

Merging #25 (8a918d7) into master (1423ea3) will increase coverage by 0.48%. The diff coverage is 100.00%.

nawtrey commented 3 years ago

I was able to get even better performance, primarily with some changes to diagrams._get_directional_connections() and removing a bottleneck in diagrams.generate_partial_diagrams() that was collecting all simple cycles in each partial diagram, counting them, and checking that they were the correct number. This was replaced with networkx.is_tree(), which produces identical results without all of the overhead. I also set generate_partial_diagrams() to return a numpy array which results in a noticeable drop in memory consumption when the function completes.

mprof_7_state_dirpar_edges

This results in the same code running to completion in ~34% of the time and using ~7% of the memory.

nawtrey commented 3 years ago

The current bottleneck for diagrams.generate_directional_partial_diagrams() is iterating over every edge in every diagram. It would be nice to just collect the rates from each edge in each diagram all at once, but everything I've seen indicates this is not currently possible. There isn't a way to feed a NetworkX function a list of edges and a diagram and output the rates for those edges. That would likely be easier to do if I used the rate matrix since the edges are just indices for the it, but that would require another large refactor since the rate matrix isn't one of the parameters.

nawtrey commented 3 years ago

The current version runs generate_directional_partial_diagrams() in 30.17 seconds versus 82.71 seconds on master.

nawtrey commented 3 years ago

In response to my comment above, I was able to get around this bottleneck by reconstructing the rate matrix from the diagram G, then using the edge list to get the indices for each value I need for each diagram, and take their product. This operates much faster, so for users who just want their state probabilities calculated in a hurry, this should be pretty beneficial.

nawtrey commented 3 years ago

I've been looking at these same changes for weeks, have added several tests, and improved the original tests substantially. I have also ran many cases with verification.py (probably 500+ diagrams) and none of these have shown me any issues, so it seems that this code is still operating correctly with significant performance improvements.

Of course, I would love to add more tests but for the time being I think the majority of the changes here are being covered quite well between the test suite and verification.py.

nawtrey commented 3 years ago

All tests are passing, merging.