Open Wilfred opened 2 years ago
maybe some interesting reading/ideas: https://github.com/anvaka/ngraph.path
Tested with https://docs.rs/priority-queue/latest/priority_queue/priority_queue/struct.PriorityQueue.html but it was a net perf loss, even with FxHash.
An explicit visited set is a perf loss too, even with FxHash and the same size hint.
Another option would be an 'iterative Dijkstra'. Search N nodes, and if we've not reached the end, take the closest node (according to an A* heuristic) and restart the search from there. This would let us keep a bound on the size of predecessors
.
HashMap::get() is really hot too. I wonder if there's a way of tracking vertex frontier such that we can tell that the vertex has never been visited before (e.g. highest LHS/RHS node ID visited so far).
I'm curious whether you've investigated RWS-Diff and whether any of the ideas may apply.
While the paper describes a complete algorithm for generating an edit script from two trees, the fundamental insight (in my opinion) is the use of K-means clustering and nearest neighbor search to approximate subtree similarity. This could be used, for instance, to chunk LHS/RHS into smaller subtree pairs that can be searched separately and then combined.
Thanks for the reference! I've not seen that paper, but it looks very relevant.
before, I've only seen shingling applied to strings and a variation of LSH. It didn't occur to me that one could "shingle" trees to find similarities.
Same here! I found another paper titled Metric Learning for Ordered Labeled Trees with pq-grams that goes a bit deeper into how pq-grams are constructed.
Bidirectional dijkstra might help, but it's really awkward to implement.
It'd be good to explore an explicit visited set for Dijkstra, alongside a priority queue that supports changing priorities. Currently I'm just inserting duplicates in the heap.
It'd also be good to revisit A* search to see if we can get a speedup there.