Feature request: UAST diffing

r0mainK commented 5 years ago

Hey guys !

So for starters just wanted to point out I am not sure if this is the proper repo to ask this, or if it should be in the SDK one. I also want to say I am not taking this lightly, I know its not a priority, and I do not expect resources to be allocated to it in the near future. That being said, before I go to the question, here is some context:

Currently, we do not do too much analysis or learning that would require this feature. In fact, when working on time series of files, we have up until know either:

not required the structural information provided only by Babelfish, and relied directly on git functionalities (blame, log, ...), for instance as implemented in src-d/hercules
parsed each blob of interest, extracted features from each UAST, then done manual diffing either for data compressiom purposes or in some cases for better features, for instance as implemented in src-d/tm-experiments

Obviously, both of those methods are flawed: the first can not really be used for deep analysis, while the second does not scale well, and involves reprocessing the same data multiple times - most notably, if diffing two subsequent commits with only a couples lines added or removed.

What I would like to be implemented is essentially a way to parse multiple versions of a files at once which would avoid this manual work and time loss, as well as enable diffing. The way I see it, given a series of versions of file, I would like to be returned something like an augmented UAST, where each node would have an additional attribute indicating the versions in which the node appears. That way, either through implemented methods or manual parsing, we would be able to get the version of the UAST corresponding to each version of the file through that single augmented UAST, as well as diffing from any versions, and hopefully with way less compute time.

My questions are:

Is this a dumb idea with no value, and we should keep parsing each file separately ?
is this a dumb idea which would result in huge UASTs or longer process times then parsing each file separately ? 2 if that is the case, have you already thought of alternative ways of getting the same kind of result, and could you detail them ?
if that is not the case, how hard/time-consuming would it be to implement, and would you be ready to do it at some point - be it even with restrictions on the number of versions, file size, etc ?

dennwc commented 5 years ago

Is this a dumb idea with no value, and we should keep parsing each file separately ?

In terms of drivers, yes, we will parse files separately anyway. The parsers that we use don't have any way of incremental processing, so for us each parsing request will be separate. Diff will be executed on a resulting UASTs one way or another.

But from an API standpoint, it may be useful to have a parse requests that accepts multiple versions of the file, parses them and returns a merged UAST for all versions. This will require some changes on our end (e.g. diffing may produce a DAG instead of a tree).

is this a dumb idea which would result in huge UASTs or longer process times then parsing each file separately ?

I don't think so. Sending multiple files will allow us to use compression that is built into the UAST binary format, so the resulting tree from parsing multiple versions will be smaller than multiple separate trees. Even without the diff the compression may lead to a minor increase in the performance.

The diff itself will take time, of course, no way around it.

if that is the case, have you already thought of alternative ways of getting the same kind of result, and could you detail them ?

I think the idea is valid and I mentioned a few reasons why it may be a good idea from the efficiency standpoint.

if that is not the case, how hard/time-consuming would it be to implement, and would you be ready to do it at some point - be it even with restrictions on the number of versions, file size, etc ?

We actually have a diff implementation already. But we haven't tested it's output apart from the fact that applying the diff will result in the correct output. And we also haven't decided how the API might look like.

So your question may start this discussion. Specifically, you mentioned a good alternative to the file diff: instead of producing a diff itself, annotate nodes with versions in some way. We may decide to use this approach, or go back to returning diffs with node IDs. We now have much more resources so we can start evaluation those approaches depending the priority of this feature.

creachadair commented 5 years ago

This is definitely a good idea, and one that has come up before, see for example https://github.com/src-d/feature-idea/issues/2, https://github.com/src-d/devrel/issues/115. Tree diff is a tricky algorithm, but manageable. The more interesting questions, in my view, are how it fits into the API and how the results are surfaced to the client. So as @dennwc says, this is a conversation we need to have if we want tree diff to happen.

r0mainK commented 5 years ago

Okay I see, thanks for the swift answer :) I was not aware of the different issues already opened on the subject, or the fact there already was an implementation already. What I don't understand is when you say:

In terms of drivers, yes, we will parse files separately anyway.

Why is that the case ? Is that due to the fact that from a theoretical point of view it is impossible, or simply due to the way Babelfish is constructed ? I might be understanding ASTs wrong as I'm not the most knowledgeable in that regard, so I'll expand a bit on what I had in mind, with a super simplified way this could be done.

Given two versions of a file, instead of parsing both files, wouldn't one be able to apply a diffing algorithm at the line level to create a concatenated file, keep an index tracking which line appears in which version, parse the resulting file with Babelfish as is, and then use the index to deduce the diff nodes, as well as each version's UAST (granted, with some work as not all nodes have positional information) ?

what I mean:

v1: line a > line b1 > line c 
v2: line a > line b2 > line c
concat: line a > line b1 >  line b2 > line c
index = {a: [v1, v2] ; b1: [v1] ; b2:[v2] ; c: [v1, v2]}
uast: nodes(a: v1, v2), nodes(b1: v1), nodes(b2: v2) nodes(c: v1, v2),

Unless I'm mistaken, this could already be done entirely on the client side for simple diffing (in order to get nodes which have positional information, like identifiers), but wouldn't it be possible to apply it to the entire UAST in the same fashion, or are there blockers I don't see ?

What I was thinking was essentially to apply a diffing alg before sending the data to the drivers, then when receiving it augment the UAST by annotating each node with the index, then send the augmented UAST to the user, and provide through the API the diffing (which indeed would probably not always result in a UAST) as well as version selection.

dennwc commented 5 years ago

@r0mainK If I understand the idea correctly, this won't work for the generic case.

Imagine the language construct that spans multiple lines. So adding/removing/changing lines nearby may break the syntax of this language construct. So instead of getting a nice merged UAST with both version we will get a syntax error from the driver.

Systems like TreeSitter will handle it much better because they support incremental parsing. In case of Babelfish we usually run parsers that are a part of the compiler, thus they are not suitable for incremental parsing on per-line level. Instead they parse the file as a whole, usually.

Having said that, the diff algorithm can definitely take positions into account. The implementation that we have right now doesn't, however.

r0mainK commented 5 years ago

@dennwc yes I kind of had that in mind, this example was more to convey the idea them describe exactly how it would work. To avoid syntax errors, I was thinking of a simplistic parsing per language, which would use knowledge of the language to compute the level of abstraction at which each line is situated:

class foo:   # 0
     def __call__(self): # 1
           return "bar" # 2
     def bar(self, l): # 1
           l = [   # 2
                 e for e in l  # 3
           ]  # 2
           return l    # 2

Then use that to merge versions - something which not require extensive parsing, and especially:

Could be used in pair with a compressed way of passing versions, ie line diffs outputed by blame for all versions but the first (something like a list of line_index, line content, bool_add, bool_removed) instead of passing all versions at once.
Due to the point above, may result in a perf increase, especially when handling 100s or 1000s of versions, as parsing in this simple way, creating a merged file, then parsing it once "for real" with the driver without Syntax Errors.
Could be done quasi-simultaneously with the actual parsing.

However as you pointed out, since Babelfish does not do incorporate incremental parsing by default, this would have to be done completely separately from the actual parsing, instead of simultaneously, which would clearly add overhead, and would entail adding some kind of wrapper per language on each driver to create a concatenated file and then the UAST, and after to augment it with version info.

Anyway thanks for the input, just tried adding my grain of salt although I did expect there would be some issues with this approach u.u"

However I do think an API receiving versions this way, and returning a single augmented UAST, would be a pretty cool design:

it would leverage functionalities that are often included upstream - such as git blame
easier object handling and manipulation post parsing
it would enable dynamic diffing between large number of versions without later parsing, in cases where you don't yet have knowledge of which diff is of interest, ie if you just want to dump the augmented UAST nodes in a DB
it would be much less data to handle for greedy clients, wishing to use UASTs for each versions, as well as diffs between each pairs of versions.

bzz commented 5 years ago

The more interesting questions, in my view, are how it fits into the API and how the results are surfaced to the client.

@creachadair I did not dig deep into this, but this seems to be the de-facto standard of API for treeDiff that people in academia use nowadays https://github.com/GumTreeDiff/gumtree/wiki/GumTree-API#getting-the-mappings-between-two-trees and https://www.monperrus.net/martin/tree-differencing has some examples/algorithms listed

bblfsh / sdk

Feature request: UAST diffing #425