Using AST to learn code grammar

JohnlNguyen / semantic_code_search

Semantic Code Search building on a fork from tensor2tensor

Apache License 2.0

0 stars 1 forks source link

Using AST to learn code grammar #2

Open JohnlNguyen opened 5 years ago

JohnlNguyen commented 5 years ago

@VHellendoorn In python, there is the ast module which parse the code snippet into a python AST. Which works like this

expr="""
def foo():
   print("hello world")
"""
p=ast.parse(expr)

and we can compile it back to source using astor

astor.code_gen.to_source(p)
"def foo():\n    print('hello world')\n

My question is how are we going to leverage this information to improve our model. Are we going to serialize the tree and use that as the output? So the task becomes NL -> serialized AST?

VHellendoorn commented 5 years ago

Yeah that seems like a good place to start. p can probably be traversed, in which case you will run into both non-terminals and terminals (aka. leafs, or tokens); the tokens are the ones you can print to reconstruct the original code. We can envision printing a simple AST of x = 0 like this: (VarDecl (Identifier x ) = 0) (or something similar; teaching it to generate this should all-but ensure syntactically valid code.

Taking a step back real quickly though, one reason we were interested in this was because the code appeared to often be syntactically invalid. However, since you said that the benchmark does actually contain punctuation, perhaps it may be good to look into other reasons why we are not generating it; it might just be the library we are using (for NLP translation I can imagine just dropping punctuation is fine and helps performance).

VHellendoorn commented 5 years ago

Haven't been able to pinpoint where the punctuation gets dropped yet. Since you have the code setup to run, could you maybe dump the data at a couple of places, especially right before it gets fed to the model? I assume it doesn't get trained with punctuation. It's possible that this happens because the code relies on some NLP tokenizer (like here); just hard to see from static inspection

JohnlNguyen commented 5 years ago

I trained the model using intent → linearized AST. The result is pretty good. I got a 90% parse rate. From manual inspection, the AST looks very close to what the source code should be.

Here are some sample AST outputs

VHellendoorn commented 5 years ago

Awesome, very interesting. Is it possible to enrich the output file with the original syntax and the syntax generated from "walking" the AST? The latter might be a bit tricky because it's not necessarily a valid AST; not sure if the AST parser can read it its own parsed output (maybe this)?

Alternatively, a heuristic version based on just traversing the leafs and in-lining some non-terminals. I might have time to write that later. Some rules that I noticed:

Call(a, b) -> a . b
Subscript(x, Slice(a, None, c) -> a[a:c]
Subscript(x, Index(y)) -> x[y]
Lambda(arguments(x, ...), Exp) -> lambda x: E

JohnlNguyen commented 5 years ago

As far as I know and from what I've read, AST parser cannot read it its own parsed string back to source code. We would need to build the parser and walk through the tree string to reconstruct the tree.
Can you clarify what you meant by “enriching the output file”?

VHellendoorn commented 5 years ago

No worries about the second point; I misread the output file as consisting of ground-truth and produced ASTs. I'll think about writing a crawler for the formatting back to syntax; if I have time, that might be good to have for a submission to the challenge