LLNL / hatchet

Graph-indexed Pandas DataFrames for analyzing hierarchical performance data
https://llnl-hatchet.readthedocs.io
MIT License
29 stars 18 forks source link

Design of subgraph_sum and subtree_sum leads to very suboptimal performance #145

Open ilumsden opened 3 months ago

ilumsden commented 3 months ago

Rule number 1 of any dataframe library is "don't do operations by iterating over rows." However, this is exactly what we do in subgraph_sum and subtree_sum. We need to refactor this to use a better mechanism (e.g., DataFrame.apply).

To get a sense of the performance impact, I can anecdotally say that subgraph_sum is 3-4x slower than the query language. And the query language is solving a version of subgraph isomorphism, an NP Hard problem.