Closed Anurag-Swarnim-Yadav closed 3 years ago
Attached is a perl script I use for my math computation graphs. It takes as input expressions written in ‘readable’ preorder notation like these 3:
X ( +s ( -s ( -s ( +s ( s e d ) ( s d a ) ) ( /s 1 ( -s 0 b ) ) ) ( +s ( -s d b ) ( +s ( s a ( /s a d ) ) ( s d 1 ) ) ) ) ( /s e ( s e ( s c d ) ) ) ) Y ( +s ( -s ( -s ( +s ( s e d ) ( s d a ) ) ( /s 1 ( -s 0 b ) ) ) ( +s ( +s ( -s d b ) ( s ( /s a d ) a ) ) d ) ) ( /s e ( s e ( s c d ) ) ) ) Z left right right left Commute X ( +v z ( v ( +v v ( -v ( v ( s e c ) v ) ( v x b ) ) ) 1 ) ) Y ( +v z ( v ( +v v ( -v ( v v ( s e c ) ) ( v b x ) ) ) 1 ) ) Z right left right right Commute X ( v ( +s ( is a ) ( ns ( +s ( s d ( -s b c ) ) ( s ( /s c a ) c ) ) ) ) ( nv y ) ) Y ( v ( +s ( is a ) ( ns ( +s ( s d ( -s b c ) ) ( *s c ( /s c a ) ) ) ) ) ( nv y ) ) Z Distribleft
And adds the format used by the GGNN to the front of each line like this:
+s +s -s -s -s -s +s +s s s e e d d s s d d a a /s /s 1 1 -s -s 0 0 b b +s +s -s +s d -s d b b s /s a d a +s d s a /s a d s d 1 /s /s e e s s e e s s c c d d
You can see in the perl script how the features and edges are tracked.
Take a look and I’m happy to help convert your problem and use it to help improve the OpenNMT documentation. What is your format?
If your format is a human language, I’ll ask you to find or create code that can parse it into a tree, my problem parses more directly thanks to the clear math syntax.
We could also set up a time for a phone meeting.
Regards, Steve
From: Anurag Swarnim Yadav Sent: Friday, June 19, 2020 11:28 AM To: SteveKommrusch/OpenNMT-py-ggnn-example Cc: Subscribed Subject: [SteveKommrusch/OpenNMT-py-ggnn-example] How to convert regularsource code file into ggnn (#1)
Hello, Could you please guide me on how we can convert the regular source code file into ggnn format like yours that we can feed directly to OpenNMT. Thanks Anurag — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or unsubscribe.
# use strict; use warnings;
if ($ARGV[0]) { print "Usage: pre2graph.pl\n"; print " Transform src sequence to graph input structure.\n"; exit(1); }
my @xform; my $nodes;
sub edges { my $l = $[0]; my $s = $[1]; my $xform = $[2]; my $nodes = $[3]; my $Lin = $[4]; my $Rin = $[5]; my $LOin = $[6]; my $LLin = $[7]; my $RLin = $[8]; my $LRin = $[9]; my $RRin = $[10];
my $s2 = int($s/2);
if (${$xform}[$l+$s] ne "Null") {
${$L_in}[${$nodes}[$l]] = ${$nodes}[$l+2]." ";
${$R_in}[${$nodes}[$l]] = ${$nodes}[$l+$s]." ";
} else {
${$LO_in}[${$nodes}[$l]] = ${$nodes}[$l+2]." ";
}
if (${$xform}[$l+4] ne "Null") {
${$LL_in}[${$nodes}[$l]] = ${$nodes}[$l+4]." ";
}
if (${$xform}[$l+$s+2] ne "Null") {
${$RL_in}[${$nodes}[$l]] = ${$nodes}[$l+$s+2]." ";
}
if (${$xform}[$l+$s2+2] ne "Null") {
${$LR_in}[${$nodes}[$l]] = ${$nodes}[$l+$s2+2]." ";
}
if (${$xform}[$l+$s+$s2] ne "Null") {
${$RR_in}[${$nodes}[$l]] = ${$nodes}[$l+$s+$s2]." ";
}
}
sub place { my $prog = $[0]; my $pos = $[1]; my $step = $_[2];
if ($xform[$pos] ne "Null") {
$xform[$pos] = "FAIL";
return;
}
if ($prog =~s/^\( (..) //) {
my $op = $1;
my $left;
my $right = "";
my $in;
if ($prog =~s/^(\([^()]*)//) {
$in=1;
$left = $1;
while ($in >0) {
if ($prog =~s/^([^()]*)//) {
$left .= $1;
}
if ($prog =~s/^(\([^()]*)//) {
$in+=1;
$left .= $1;
}
if ($prog =~s/^(\)\s+)//) {
$in-=1;
$left .= $1;
}
}
} else {
$prog =~s/(.) //;
$left = $1;
}
if ($prog =~s/^(\([^()]*)//) {
$in=1;
$right = $1;
while ($in >0) {
if ($prog =~s/^([^()]*)//) {
$right .= $1;
}
if ($prog =~s/^(\([^()]*)//) {
$in+=1;
$right .= $1;
}
if ($prog =~s/^(\)\s+)//) {
$in-=1;
$right .= $1;
}
}
} else {
$prog =~s/(.) // ;
if ($1 ne ")") {
$right = $1;
}
}
if ($step < 4) {
$xform[$pos] = "FAIL"
} else {
$xform[$pos] = $op;
$nodes++;
place($left,$pos+2,int($step/2));
if ($right) {
place($right,$pos+$step,int($step/2));
}
}
} else {
# No tree below this node
$prog =~s/ *$//;
$xform[$pos] = $prog;
$nodes++;
}
return;
}
while (<>) { @xform = ("Null") x 256; $nodes=0; /X (. )Y (. )(Z .*)$/ || die "Bad syntax on input $"; my $progA = $1; my $progB = $2; my $transform = $3; place($progA,0,128); place($progB,1,128); my $merged = join(" ",@xform); die "Nodes > 100" if $nodes > 100; die "Programs too deep: X $progA Y $progB $transform with $merged" if $merged =~/FAIL/; $ = $merged;
@xform = ("Null") x 256;
my @nodes = ("Null") x 256;
my $i = 0;
my $node=0;
while (s/^(\S+)\s*//){
if ($1 ne "Null") {
$xform[$i] = $1;
$nodes[$i] = $node;
print "$1 ";
$node++;
}
$i++;
}
print "<EOT> ";
my @features = ("<unk> ") x ($node+1);
my @L_in = ("") x ($node+1);
my @R_in = ("") x ($node+1);
my @LL_in = ("") x ($node+1);
my @LR_in = ("") x ($node+1);
my @RL_in = ("") x ($node+1);
my @RR_in = ("") x ($node+1);
my @LO_in = ("") x ($node+1);
my @Strt_in = ("") x ($node+1);
my @End_in = ("") x ($node+1);
$features[$node] = "14 ";
foreach my $lvl0 ( 0 , 1 ) {
$features[$nodes[$lvl0]] = $lvl0." ";
if ($xform[$lvl0+2] ne "Null") {
edges($lvl0,128,\@xform,\@nodes,\@L_in,\@R_in,\@LO_in,\@LL_in,\@RL_in,\@LR_in,\@RR_in);
}
foreach my $lvl1 ( $lvl0+2 , $lvl0+128 ) {
if ($xform[$lvl1] eq "Null") {
next;
}
$features[$nodes[$lvl1]] = ($lvl0+2)." ";
if ($xform[$lvl1+2] ne "Null") {
edges($lvl1,64,\@xform,\@nodes,\@L_in,\@R_in,\@LO_in,\@LL_in,\@RL_in,\@LR_in,\@RR_in);
}
foreach my $lvl2 ( $lvl1+2 , $lvl1+64 ) {
if ($xform[$lvl2] eq "Null") {
next;
}
$features[$nodes[$lvl2]] = ($lvl0+4)." ";
if ($xform[$lvl2+2] ne "Null") {
edges($lvl2,32,\@xform,\@nodes,\@L_in,\@R_in,\@LO_in,\@LL_in,\@RL_in,\@LR_in,\@RR_in);
}
foreach my $lvl3 ( $lvl2+2 , $lvl2+32 ) {
if ($xform[$lvl3] eq "Null") {
next;
}
$features[$nodes[$lvl3]] = ($lvl0+6)." ";
if ($xform[$lvl3+2] ne "Null") {
edges($lvl3,16,\@xform,\@nodes,\@L_in,\@R_in,\@LO_in,\@LL_in,\@RL_in,\@LR_in,\@RR_in);
}
foreach my $lvl4 ( $lvl3+2 , $lvl3+16 ) {
if ($xform[$lvl4] eq "Null") {
next;
}
$features[$nodes[$lvl4]] = ($lvl0+8)." ";
if ($xform[$lvl4+2] ne "Null") {
edges($lvl4,8,\@xform,\@nodes,\@L_in,\@R_in,\@LO_in,\@LL_in,\@RL_in,\@LR_in,\@RR_in);
}
foreach my $lvl5 ( $lvl4+2 , $lvl4+8 ) {
if ($xform[$lvl5] eq "Null") {
next;
}
$features[$nodes[$lvl5]] = ($lvl0+10)." ";
if ($xform[$lvl5+2] ne "Null") {
$features[$nodes[$lvl5+2]] = ($lvl0+12)." ";
if ($xform[$lvl5+4] ne "Null") {
$features[$nodes[$lvl5+4]] = ($lvl0+12)." ";
$L_in[$nodes[$lvl5]] = $nodes[$lvl5+2]." ";
$R_in[$nodes[$lvl5]] = $nodes[$lvl5+4]." ";
} else {
$LO_in[$nodes[$lvl5]] = $nodes[$lvl5+2]." ";
}
}
}
}
}
}
}
}
print join("",@features);
print "<EOT> ";
for (my $i = 0; $i <= $node; $i++) {
if ($L_in[$i] ne "") {
print ("$i $L_in[$i]");
}
}
print ", ";
for (my $i = 0; $i <= $node; $i++) {
if ($R_in[$i] ne "") {
print ("$i $R_in[$i]");
}
}
print ", ";
for (my $i = 0; $i <= $node; $i++) {
if ($LL_in[$i] ne "") {
print ("$i $LL_in[$i]");
}
}
print ", ";
for (my $i = 0; $i <= $node; $i++) {
if ($LR_in[$i] ne "") {
print ("$i $LR_in[$i]");
}
}
print ", ";
for (my $i = 0; $i <= $node; $i++) {
if ($RL_in[$i] ne "") {
print ("$i $RL_in[$i]");
}
}
print ", ";
for (my $i = 0; $i <= $node; $i++) {
if ($RR_in[$i] ne "") {
print ("$i $RR_in[$i]");
}
}
print ", ";
for (my $i = 0; $i <= $node; $i++) {
if ($LO_in[$i] ne "") {
print ("$i $LO_in[$i]");
}
}
print ", ";
$Strt_in[$node]="0 ";
for (my $i = 0; $i <= $node; $i++) {
if ($Strt_in[$i] ne "") {
print ("$i $Strt_in[$i]");
}
}
print ", ";
$End_in[$node]="1 ";
for (my $i = 0; $i <= $node; $i++) {
if ($End_in[$i] ne "") {
print ("$i $End_in[$i]");
}
}
print "X ",$progA,"Y ",$progB, $transform, "\n";
}
Hey @SteveKommrusch
Thanks a lot for the prompt reply. My input is a text file with a bunch of c++ programs or functions in it. I was wondering how I can convert my input in a format that I can feed to the ggnn.
Thanks Anurag
Anurag,
Parsing CPP sounds really interesting, I’m happy to help you make progress.
You could take a look at ANTLR: https://www.antlr.org/ I’ve used it for C source code parsing (but not C++). Here is the C++ grammar file: https://github.com/antlr/grammars-v4/blob/master/cpp/CPP14.g4
After downloading ANTLR, you can use it to create a tree output using these commands: alias grun='java -Xmx500M -cp "/usr/local/lib/antlr-4.7.1-complete.jar:$CLASSPATH" org.antlr.v4.gui.TestRig' alias antlr4='java -Xmx500M -cp "/usr/local/lib/antlr-4.7.1-complete.jar:$CLASSPATH" org.antlr.v4.Tool' antlr4 CPP14.g4 javac C*.java grun C compilationUnit -tree yourfile.cpp
BTW, I’m on vacation this week, but will be able to help more starting 6/29.
Regards, Steve
From: Anurag Swarnim Yadav Sent: Monday, June 22, 2020 12:45 PM To: SteveKommrusch/OpenNMT-py-ggnn-example Cc: Steve Kommrusch; Mention Subject: Re: [SteveKommrusch/OpenNMT-py-ggnn-example] How to convert regularsource code file into ggnn (#1)
Hey @SteveKommrusch Thanks a lot for the prompt reply. My input is a text file with a bunch of c++ programs or functions in it. I was wondering how I can convert my input in a format that I can feed to the ggnn. Thanks Anurag — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.
Hey @SteveKommrusch First of all, thank you very much for pointing me in the right direction. I will go ahead and look into the tool and see If I can parse the C program. Also, I will update you with the results. Apart from this tool, we have cppminer and astminer that can do our job too. Let me know if you want to discuss on this.
Please let me know when you come back.
Enjoy your vacation
Thanks Anurag
Hey @SteveKommrusch,
I got ANTLR working. Please let me know your availability whenever possible. So that we can convert it into the ggnn input.
Thanks Anurag
To close out this issue; Anurag and I continued working through email and I have created a documented example (recently updated to use the GGNN embedding layer as of Jan 2021). The example is documented in the primary README.md for this repo.
Hi, Thank you for this amazing work. Can we use src/raw2graph.pl to generate input for Java files. I've used ANTLR via [Astminer] to parse java files. The output ASTs are in the following form:
{"label":"1542.java","ast":[{"token":"
Should I give this as input to src/raw2graph.pl to generate input suitable for ggnn. I've dataset of Java files and I want to train this model on that. Thank you.
Hello. The README.md file describes the example setup and use of the environment. If you look in src/setupgraph2seq.sh you can see how raw2graph.pl is used. For this example, the ANTLR files are in data/antlr_files and the AST is given with a parenthetical syntax. setupgraph2seq.sh will process this output and add the target for the model to predict - in this case the target is the filename of the antlr file (the filenames are the method names to predict). Note that this example doesn't have enough samples to train well, it is just an example of how to set up the system with ANTLR.
I used this ggnn logic for this paper about program equivalence: https://arxiv.org/pdf/2106.02452.pdf, but that did not leverage ANTLR files. I think you could use ANTLR for java with this ggnn so long as you convert to a parenthesis format. (Note, it is not good to have very deep AST trees, so processing the ANTLR in some way to remove unecessary layers is useful).
I found that for the problem of program equivalence I had more success with a transformer model as detailed in this paper: https://arxiv.org/abs/2109.10476
Hello,
Could you please guide me on how we can convert the regular source code file into ggnn format like yours that we can feed directly to OpenNMT.
Thanks Anurag