bblfsh / libuast

Apache License 2.0
20 stars 15 forks source link

Feature request: node hashes #78

Open vmarkovtsev opened 6 years ago

vmarkovtsev commented 6 years ago

I have already seen several times that it is often required to quickly compare two sets of UAST nodes. The naive approach is to compare each to each which is O(n^2). If we had a standard way to hash UAST nodes, we would solve this problem in O(n).

There are two ways of hashing nodes: a single tiny change rewrites the hash (1) and locality sensitive hashing (2). (1) is easy to implement by hashing the serialization of a node. (2) requires some research, and I've got a sufficiently working implementation in https://github.com/src-d/treediff/blob/master/treediff.py#L33 which recursively uses hashes of the sorted children.

juanjux commented 6 years ago

You could also use the node positions in the tree relative to the root as "hash" (key, really). We'll almost surely be adding hashes or keys as part of the "UAST 2" since they'll be needed to implement some of the approved features.

vmarkovtsev commented 6 years ago

Node positions are not influenced by the token values, are they.

juanjux commented 6 years ago

Sorry, I wasn't clear, I didn't mean the token positions in the source code but the node positions in the tree, take a look at the ascii "art" at the start of this (superceed) draft:

https://gist.github.com/juanjux/5c905c8bfbf2091b6da43eb76dd822e9

dennwc commented 6 years ago

@vmarkovtsev Do you need the hashes specifically or (a hash-based) Equal operation will work for you as well?

vmarkovtsev commented 6 years ago

Yep tree positions matter.

I need the explicit hashes, Equal does not help with going from O(n^2) to O(n)

dennwc commented 6 years ago

I assumed that this method can compare node sets and hashes are cached by libuast automatically.

But yeah, explicit hashes are easier to implement.