Open JohnlNguyen opened 5 years ago
Yeah that seems like a good place to start. p
can probably be traversed, in which case you will run into both non-terminals and terminals (aka. leafs, or tokens); the tokens are the ones you can print to reconstruct the original code. We can envision printing a simple AST of x = 0
like this:
(VarDecl (Identifier x ) = 0)
(or something similar; teaching it to generate this should all-but ensure syntactically valid code.
Taking a step back real quickly though, one reason we were interested in this was because the code appeared to often be syntactically invalid. However, since you said that the benchmark does actually contain punctuation, perhaps it may be good to look into other reasons why we are not generating it; it might just be the library we are using (for NLP translation I can imagine just dropping punctuation is fine and helps performance).
Haven't been able to pinpoint where the punctuation gets dropped yet. Since you have the code setup to run, could you maybe dump the data at a couple of places, especially right before it gets fed to the model? I assume it doesn't get trained with punctuation. It's possible that this happens because the code relies on some NLP tokenizer (like here); just hard to see from static inspection
I trained the model using intent → linearized AST. The result is pretty good. I got a 90% parse rate. From manual inspection, the AST looks very close to what the source code should be.
Here are some sample AST outputs
Awesome, very interesting. Is it possible to enrich the output file with the original syntax and the syntax generated from "walking" the AST? The latter might be a bit tricky because it's not necessarily a valid AST; not sure if the AST parser can read it its own parsed output (maybe this)?
Alternatively, a heuristic version based on just traversing the leafs and in-lining some non-terminals. I might have time to write that later. Some rules that I noticed:
No worries about the second point; I misread the output file as consisting of ground-truth and produced ASTs. I'll think about writing a crawler for the formatting back to syntax; if I have time, that might be good to have for a submission to the challenge
@VHellendoorn In python, there is the ast module which parse the code snippet into a python AST. Which works like this
and we can compile it back to source using astor
My question is how are we going to leverage this information to improve our model. Are we going to serialize the tree and use that as the output? So the task becomes NL -> serialized AST?