bblfsh / sdk

Babelfish driver SDK
GNU General Public License v3.0
23 stars 27 forks source link

Proposal: Allow DAGs in UAST #339

Open dennwc opened 5 years ago

dennwc commented 5 years ago

As discussed in multiple LA meetings, I'd like to enable DAG support in the SDK.

Although, there might be an ambiguity on what it actually entails. I showed multiple examples where DAGs were used with a single UAST (tree-like), multiple roots (token stream, native AST, UAST) and more graph-like structures (all mentioned UAST roots, schema nodes, root-less annotations pointing to UAST nodes, etc).

For now, I only propose to enable the simplest use case: a single UAST root, where a node can be referenced multiple times (without loops).

The goal of this issue is to start a discussion. Later it will be split into separate issues covering specific topics of the conversion. It will also serve as a first step toward full DAG support.

Use cases:

Required changes:

SDK will need to be updated to handle these kind of nodes properly. The change requires to track visited nodes and should be easy to implement.

Thanks to opaque binary format, there are no changes to the protocol.

The binary format, as it is right now, supports DAGs out of the box. However, it tries to automatically compress leaf subtrees by referencing the same node ID multiple times. This will no longer work with DAGs, since decoder will assume that its the same node, regardless if a node was generated by the compressor or not. Thus, we will need to add a new field to protobuf definition of the node. This field can be defined as bool clone and will control whatever this node was made for compression reasons (true), or it's exactly the same node from semantic point of view (false).

YAML encoder will probably need to be changed to print implicit node IDs. We could use YAML built-in node ID references, but turned out that there is no YAML library in Go that supports this feature. Thus, the best way is to encode an additional @ref pseudo-property, and deduplicate nodes based on this property when decoding. Same can be done for JSON.

Babelfish Web can probably ignore this change for now. The proxy can simply take the in-memory DAG and serialize it to JSON without using our encoder. This operation is safe and will only lead to learger JSON output for the Babelfish Web. Later, it can be updated to use our JSON encode and support @ref for a more compact encoding.

Libuast's UastLoad method will need to be updated to track visited nodes. It will then automatically work for all clients.

CC @bblfsh/maintainers

juanjux commented 5 years ago

https://github.com/src-d/backlog/issues/1321#issuecomment-446912588