cnellington / Contextualized

An SKLearn-style toolbox for estimating and analyzing models, distributions, and functions with context-specific parameters.
http://contextualized.ml/
GNU General Public License v3.0
64 stars 7 forks source link

Project-to-DAG runtime #205

Open cnellington opened 1 year ago

cnellington commented 1 year ago

Currently we do DAG projection by binary search O(logn) over sorted weights O(nlogn), checking if the graph where weights > threshold is a DAG via igraph is_dag O(n*n). The igraph implementation gets a spanning tree to check for dagness. This is prohibitively slow for large DAGs.

Can we speed this up? Maybe by using spanning trees to construct a new graph.

cnellington commented 1 year ago

Toy example to greedily build a topological order by maximizing the sum of outgoing edge weight magnitudes at each level. Projects 1e4 random fully-connected graphs with 50 nodes each in about 1 min.

import numpy as np
n = 10
mat = np.random.normal(0, 1, (n, n))
print(np.sum(np.triu(np.abs(mat))))
nodes_to_order = set(range(n))
order = []
for i in range(n):
    max_score = 0
    max_node = -1
    for v in nodes_to_order:
        score = np.sum(np.abs(mat[v, i:]))
        if score > max_score:
            max_score = score
            max_node = v 
    nodes_to_order.remove(max_node)
    order.append(max_node)
print(order)
print(np.sum(np.triu(np.abs(mat[order]))))
cnellington commented 1 year ago

While this maximizes the sum of child edges as each level, this algorithm would also place a root node with outdegree 1 last in the topological order when it would be optimal to put this node first. Instead we could greedily select root-like nodes that have the smallest sum of incoming edges, since we know there would be another node that would add more to the total edge weight sum as a child of other nodes.

import numpy as np
n = 10
mat = np.random.normal(0, 1, (n, n))
nodes_to_order = set(range(n))
order = []
for i in range(n):
    min_score = 0
    min_node = -1
    for v in nodes_to_order:
        score = np.sum(np.abs(mat[i:, v]))  # reversed to look at incoming edge sum, how much score do we lose by selecting v
        if score < min_score:
            min_score = score
            min_node = v 
    nodes_to_order.remove(min_node)
    order.append(min_node)
print(np.sum(np.triu(np.abs(mat))))
print(order)
print(np.sum(np.triu(np.abs(mat[order]))))
szhorvat commented 1 year ago

I came across this by accident, and I'm likely misunderstanding what you are saying ...

The igraph implementation gets a spanning tree to check for dagness. This is prohibitively slow for large DAGs.

But I wanted to point out that the linear-time algorithm igraph uses to test if a graph is a DAG does not involve the construction of any spanning trees

cnellington commented 1 year ago

Thanks for clarifying @szhorvat!

blengerich commented 1 year ago

Thanks @szhorvat; btw how did you find this project?

szhorvat commented 1 year ago

Can you clarify if there is an issue with the performance of igraph's DAG testing?

The current implementation is based on removing source (i.e. zero in-degree) vertices until none are left. If the entire graph was removed, then it was a DAG.

An alternative is performing graph traversal (DFS or BFS) to look for cycles. This is not implemented at the moment.

Both are linear-time in the worst case, i.e. when the graph is acyclic, they will both explore all edges. It's unclear which one would be faster in practice. It might depend on the structure of the graph (does it have cycles?). More likely, the performance difference depends on hardware-related non-algorithmic factors, so this needs benchmarking.

At some point I might implement a DFS-based function that doesn't just check if the graph is acyclic, but provides proof, i.e. returns a cycle if there is one.

We always welcome feedback about how people use igraph and what their needs are. Feel free to contact us on the forum https://igraph.discourse.group/

cnellington commented 1 year ago

Based on what you've said, it sounds like the issue is not igraph's runtime but I'll continue to update the issue this week.