Enhancement - Antlr5 - roadmap / explorations / machine translation using neural nets

antlr / antlr4

ANTLR (ANother Tool for Language Recognition) is a powerful parser generator for reading, processing, executing, or translating structured text or binary files.

http://antlr.org

BSD 3-Clause "New" or "Revised" License

16.93k stars 3.26k forks source link

Enhancement - Antlr5 - roadmap / explorations / machine translation using neural nets #1866

Open johndpope opened 7 years ago

johndpope commented 7 years ago

I've been reviewing tensorflow recently and wanted to share this idea. It is most likely not suitable for this repo/ but not sure a better place to log this.

At the essence, think of one day translating code from one language to another in as simple a way as using google translate. This is actually very doable today using tensorflow + syntaxnet.

What follows could be one approach. Although more recent developments in dragnn could probably be factored in. Dragnn at its core builds its own gramma files. (Haven't fully got my head around how this could be seeded)

Consider that, Given a gramma file Some how / programmatically generate valid code in multiple languages. (Of use precaanned ones from Wikipedia ) Parse the AST in a super language (swift) Train the net on this AST representation(s) Given this AST in this language the corresponding AST in this other language is.... Use DCGAN to forge valid /compilable code.

Consider that this trained model would be thrown away each month with competing models.

KvanTTT commented 7 years ago

Why do you miss in ANTLR 4 for implementing such translator?

johndpope commented 7 years ago

In all honesty, it's an unknown. Tackling this problem alone is not my intention. But consider having support of industry to help. Perhaps a competition to translate the code would be appropriate. This way it would flesh out problem to find out what's missing. It could be sponsored by IBM / google / Microsoft / nvidia / intel.

The training data is important / but being able to programmatically generate code maybe critical to feed back into training.

linonetwo commented 7 years ago

Is there a tool that can write CFG for some language X, then generate a code generator, that can translate AST that generated by Antlr4, to language X? Currently?

Or the code generated part can only be written by hand?

KvanTTT commented 7 years ago

We (@PositiveTechnologies) use a unified AST (UST) in our open source Pattern Matching Engine PT.PM. UST obtained by converting an ANTLR parse tree which obtained from the parser and thus from the grammar.

Also we are developing a new proprietary engine for analyzing data flows on unified AST. For this UST being converted to CFG, to PDG, and to combined representation CPG (UST + CFG + PDG).

So, you can use the first project as a base for unified CFG.

johndpope commented 6 years ago

updates in this space https://github.com/src-d/code2vec - fyi @zurk

zurk commented 6 years ago

Hello, guys. Yes, we are implementing this article: https://arxiv.org/pdf/1803.09473.pdf using our own tooling like https://github.com/src-d/ml (for machine learning on the source code) and https://github.com/bblfsh/bblfshd/ to get Universal AST to be able to work with all languages in the same way.

What about google translator for the code it is a really cool idea and can be possible in some cases, but have a lot of underwater rocks. For example, we have sort function in many languages and in some of them you have nan value at the beginning of the list, in other cases in the end. Ok, have fun :) It can change your program behavior completely.

P.S.: You guys have a really good tool!

johndpope commented 5 years ago

Some work by Pengcheng Yin @pcyin & Graham Neubig

A Syntactic Neural Model for General-Purpose Code Generation https://arxiv.org/abs/1704.01696 We consider the problem of parsing natural language descriptions into source code written in a general-purpose programming language like Python. Existing datadriven methods treat this problem as a language generation task without considering the underlying syntax of the target programming language. Informed by previous work in semantic parsing, in this paper we propose a novel neural architecture powered by a grammar model to explicitly capture the target syntax as prior knowledge. Experiments find this an effective way to scale up to generation of complex programs from natural language descriptions, achieving state-of-the-art results that well outperform previous code generation and semantic parsing approaches.

This paper proposes a syntax-driven neural code generation approach that generates an abstract syntax tree by sequentially applying actions from a grammar model.

https://github.com/pcyin/NL2code

@RaphaelOlivier - also noteworthy. https://github.com/RaphaelOlivier/sempar-codgen

johndpope commented 5 years ago

contributions by @sriniiyer code + paper - Summarizing Source Code using a Neural Attention Model - CODENN https://github.com/sriniiyer/codenn https://github.com/sriniiyer/codenn/blob/master/summarizing_source_code.pdf

bitnom commented 5 years ago

I've also been investigating this possibility. Just came across this repo: https://github.com/pcyin/tranX

johndpope commented 5 years ago

related - natural language to executable code https://github.com/pcyin/NL2code https://arxiv.org/abs/1704.01696

fyi @pcyin / @neubig

UPDATE - https://github.com/github/CodeSearchNet

inshua commented 2 years ago

backward access
reference subtyped rule (# marked rule)
rule as set, support UNION, EXCLUDE and other operators

KvanTTT commented 2 years ago

@inshua Could you elaborate on these?

inshua commented 2 years ago

@inshua Could you elaborate on these?

backward access

reference subtyped rule (# marked rule) i.e. VB support this syntax

For i = 1 To 10 
For j = 1 To 10
...
Next j, i  ' close both j and i

I have solved it, but if ANTLR support backward reference it will be better

nextStmt:
NEXT    # OnlyNext
| NEXT identifier   #NextId
| nextStmt#NextId(-1) identifier  # NextIdMore   // backward1 is NextId
| nextStmt#NextIdMore(-1) identifier # NextIdMore2
;

Here I show reference subtyped rule too, they are nextStmt#NextId and nextStmt#NextIdMore.

rule as set, support UNION, EXCLUDE

we can write rule as

rule : (rule1 | rule2);

It's good, but if we treat rule as set, it just equivs rule1 UNION rule2, we should support rule1 SUBTRACT rule2, rule1 AND rule2 too. i.e.

// wrong rules,  just for a presentation
multi: '*' | '/';
op : '+' | '-' | multi;
multiExpr: expr multi expr;
expr: multiExpr (op - multi) multiExpr;   // `op - multi` got '+' '-'

And rule1 AND rule2 are very useful too.

KvanTTT commented 2 years ago

rule as set, support UNION, EXCLUDE

Not sure such functionality should be integrated into ANTLR. It's out of EBNF and it's hard to imagine cases where it's required.

inshua commented 2 years ago

rule as set, support UNION, EXCLUDE

Not sure such functionality should be integrated into ANTLR. It's out of EBNF and it's hard to imagine cases where it's required.

yes, it's very useful, like subtyped rules. I'll post more cases when I met new.