cnellington / Contextualized

An SKLearn-style toolbox for estimating and analyzing models, distributions, and functions with context-specific parameters.
http://contextualized.ml/
GNU General Public License v3.0
64 stars 7 forks source link

new project_to_dag algorithm #206

Open blengerich opened 1 year ago

blengerich commented 1 year ago

Prompted by #205 , here is a simplified project_to_dag algorithm. Instead of searching over edge weights to threshold out, it does a topological sort of the nodes and then returns only the forward edges. The topological sort is just a simple ordering by sum of edge weight magnitudes (more and stronger edges => higher priority to keep the edges in this node).

A quick test of performance is included in the Jupyter notebook. The topological sort function could (possibly should?) be improved, but the sort-based projection appears to get similar performance to the search-based algorithm:

image

with much reduced runtime:

image

review-notebook-app[bot] commented 1 year ago

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

cnellington commented 1 year ago

The speedup is definitely worth the cost here, but there's a few edge cases to consider. Here's one failure case for topological sort.

X -2-> Y 
Y -4-> Z

Where it's already a dag but topological sort by outgoing edge magnitudes will remove the first edge. Both algorithms often ignore root nodes with few/small outgoing edges. Just to emphasize that neither are perfect, both sort and score fail here.

X -2-> Y
Y -4-> Z
Z -3-> Y

I wonder if it's even possible to come up with an edge-removal function $f$ such that for any $G \in R^{n \times n}$, we maximize $maxf \sum{i,j}|f(G)_{i,j}|$ s.t. $f(G)$ is a dag?

cnellington commented 1 year ago

Re: https://github.com/cnellington/Contextualized/issues/205#issuecomment-1636695133 Is it possible that the binary search implementation is causing runtime issues? Python is very slow at recursion. Seems like it might not be the is_dag check that's the issue, and I don't think binary search should incur serious runtime problems? It might be worthwhile reimplementing the binary search as a while loop to see if we can get the search algorithm to work quickly.