SteveKommrusch / OpenNMT-py-ggnn-example

Training, validation, and test files for GGNN encoder option in OpenNMT-py.
10 stars 4 forks source link

How to convert regular source code file into ggnn #1

Closed Anurag-Swarnim-Yadav closed 3 years ago

Anurag-Swarnim-Yadav commented 4 years ago

Hello,

Could you please guide me on how we can convert the regular source code file into ggnn format like yours that we can feed directly to OpenNMT.

Thanks Anurag

SteveKommrusch commented 4 years ago

Attached is a perl script I use for my math computation graphs. It takes as input expressions written in ‘readable’ preorder notation like these 3:

X ( +s ( -s ( -s ( +s ( s e d ) ( s d a ) ) ( /s 1 ( -s 0 b ) ) ) ( +s ( -s d b ) ( +s ( s a ( /s a d ) ) ( s d 1 ) ) ) ) ( /s e ( s e ( s c d ) ) ) ) Y ( +s ( -s ( -s ( +s ( s e d ) ( s d a ) ) ( /s 1 ( -s 0 b ) ) ) ( +s ( +s ( -s d b ) ( s ( /s a d ) a ) ) d ) ) ( /s e ( s e ( s c d ) ) ) ) Z left right right left Commute X ( +v z ( v ( +v v ( -v ( v ( s e c ) v ) ( v x b ) ) ) 1 ) ) Y ( +v z ( v ( +v v ( -v ( v v ( s e c ) ) ( v b x ) ) ) 1 ) ) Z right left right right Commute X ( v ( +s ( is a ) ( ns ( +s ( s d ( -s b c ) ) ( s ( /s c a ) c ) ) ) ) ( nv y ) ) Y ( v ( +s ( is a ) ( ns ( +s ( s d ( -s b c ) ) ( *s c ( /s c a ) ) ) ) ) ( nv y ) ) Z Distribleft

And adds the format used by the GGNN to the front of each line like this:

+s +s -s -s -s -s +s +s s s e e d d s s d d a a /s /s 1 1 -s -s 0 0 b b +s +s -s +s d -s d b b s /s a d a +s d s a /s a d s d 1 /s /s e e s s e e s s c c d d 0 1 2 3 4 5 6 7 8 9 10 11 10 11 8 9 10 11 10 11 6 7 8 9 8 9 10 11 10 11 4 5 6 7 8 9 11 11 8 9 11 13 13 11 6 7 8 10 10 12 12 8 10 10 2 3 4 5 4 5 6 7 6 7 8 9 8 9 14 0 2 1 3 2 4 3 5 4 6 5 7 6 8 7 9 8 10 9 11 14 16 15 17 20 22 21 23 24 26 25 27 30 32 31 33 32 34 33 35 35 36 39 40 40 41 44 46 46 47 48 49 51 52 54 56 55 57 58 60 59 61 62 64 63 65 , 0 54 1 55 2 30 3 31 4 20 5 21 6 14 7 15 8 12 9 13 14 18 15 19 20 24 21 25 24 28 25 29 30 44 31 45 32 38 33 39 35 37 39 43 40 42 44 51 46 48 48 50 51 53 54 58 55 59 58 62 59 63 62 66 63 67 , 0 4 1 5 2 6 3 7 4 8 5 9 6 10 7 11 30 34 31 35 33 36 39 41 44 47 , 0 30 1 31 2 20 3 21 4 14 5 15 6 12 7 13 30 38 31 39 33 37 39 42 44 48 , 0 56 1 57 2 32 3 33 4 22 5 23 6 16 7 17 20 26 21 27 30 46 33 40 44 52 46 49 54 60 55 61 58 64 59 65 , 0 58 1 59 2 44 3 45 4 24 5 25 6 18 7 19 20 28 21 29 30 51 33 43 44 53 46 50 54 62 55 63 58 66 59 67 , , 68 0 , 68 1 X ( +s ( -s ( -s ( +s ( s e d ) ( s d a ) ) ( /s 1 ( -s 0 b ) ) ) ( +s ( -s d b ) ( +s ( s a ( /s a d ) ) ( s d 1 ) ) ) ) ( /s e ( s e ( s c d ) ) ) ) Y ( +s ( -s ( -s ( +s ( s e d ) ( s d a ) ) ( /s 1 ( -s 0 b ) ) ) ( +s ( +s ( -s d b ) ( s ( /s a d ) a ) ) d ) ) ( /s e ( s e ( s c d ) ) ) ) Z left right right left Commute +v +v z z v v +v +v v v -v -v v v s v e c v s e c v v x b b x 1 1 0 1 2 3 2 3 4 5 6 7 6 7 8 9 10 11 12 12 10 11 13 13 8 9 10 11 10 11 4 5 14 0 2 1 3 4 6 5 7 6 8 7 9 10 12 11 13 12 14 13 15 14 16 19 20 22 24 23 25 , 0 4 1 5 4 28 5 29 6 10 7 11 10 22 11 23 12 18 13 19 14 17 19 21 22 26 23 27 , 4 8 5 9 10 14 11 15 12 16 , 4 10 5 11 10 18 11 19 12 17 , 0 6 1 7 6 12 7 13 10 24 11 25 13 20 , 0 28 1 29 6 22 7 23 10 26 11 27 13 21 , , 30 0 , 30 1 X ( +v z ( v ( +v v ( -v ( v ( s e c ) v ) ( v x b ) ) ) 1 ) ) Y ( +v z ( v ( +v v ( -v ( v v ( s e c ) ) ( v b x ) ) ) 1 ) ) Z right left right right Commute v v +s +s is is a a ns ns +s +s s s d d -s -s b b c c s s /s c c a c /s c a nv nv y y 0 1 2 3 4 5 6 7 4 5 6 7 8 9 10 11 10 11 12 13 12 13 8 9 10 11 12 12 10 11 13 13 2 3 4 5 14 0 2 1 3 2 4 3 5 10 12 11 13 12 14 13 15 16 18 17 19 22 24 23 25 24 26 29 30 , 0 32 1 33 2 8 3 9 10 22 11 23 12 16 13 17 16 20 17 21 22 28 23 29 24 27 29 31 , 0 4 1 5 2 6 3 7 8 12 9 13 10 14 11 15 22 26 , 0 8 1 9 8 22 9 23 10 16 11 17 22 27 , 0 34 1 35 2 10 3 11 10 24 11 25 12 18 13 19 23 30 , 10 28 11 29 12 20 13 21 23 31 , 4 6 5 7 8 10 9 11 32 34 33 35 , 36 0 , 36 1 X ( v ( +s ( is a ) ( ns ( +s ( s d ( -s b c ) ) ( s ( /s c a ) c ) ) ) ) ( nv y ) ) Y ( v ( +s ( is a ) ( ns ( +s ( s d ( -s b c ) ) ( s c ( /s c a ) ) ) ) ) ( nv y ) ) Z Distribleft

You can see in the perl script how the features and edges are tracked.

Take a look and I’m happy to help convert your problem and use it to help improve the OpenNMT documentation. What is your format?

If your format is a human language, I’ll ask you to find or create code that can parse it into a tree, my problem parses more directly thanks to the clear math syntax.

We could also set up a time for a phone meeting.

Regards, Steve

From: Anurag Swarnim Yadav Sent: Friday, June 19, 2020 11:28 AM To: SteveKommrusch/OpenNMT-py-ggnn-example Cc: Subscribed Subject: [SteveKommrusch/OpenNMT-py-ggnn-example] How to convert regularsource code file into ggnn (#1)

Hello, Could you please guide me on how we can convert the regular source code file into ggnn format like yours that we can feed directly to OpenNMT. Thanks Anurag — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or unsubscribe.

!/usr/bin/perl

# use strict; use warnings;

if ($ARGV[0]) { print "Usage: pre2graph.pl\n"; print " Transform src sequence to graph input structure.\n"; exit(1); }

my @xform; my $nodes;

sub edges { my $l = $[0]; my $s = $[1]; my $xform = $[2]; my $nodes = $[3]; my $Lin = $[4]; my $Rin = $[5]; my $LOin = $[6]; my $LLin = $[7]; my $RLin = $[8]; my $LRin = $[9]; my $RRin = $[10];

my $s2 = int($s/2);

if (${$xform}[$l+$s] ne "Null") {
  ${$L_in}[${$nodes}[$l]] = ${$nodes}[$l+2]." ";
  ${$R_in}[${$nodes}[$l]] = ${$nodes}[$l+$s]." ";
} else {
  ${$LO_in}[${$nodes}[$l]] = ${$nodes}[$l+2]." ";
}
if (${$xform}[$l+4] ne "Null") {
  ${$LL_in}[${$nodes}[$l]] = ${$nodes}[$l+4]." ";
}
if (${$xform}[$l+$s+2] ne "Null") {
  ${$RL_in}[${$nodes}[$l]] = ${$nodes}[$l+$s+2]." ";
}
if (${$xform}[$l+$s2+2] ne "Null") {
  ${$LR_in}[${$nodes}[$l]] = ${$nodes}[$l+$s2+2]." ";
}
if (${$xform}[$l+$s+$s2] ne "Null") {
  ${$RR_in}[${$nodes}[$l]] = ${$nodes}[$l+$s+$s2]." ";
}

}

sub place { my $prog = $[0]; my $pos = $[1]; my $step = $_[2];

if ($xform[$pos] ne "Null") {
    $xform[$pos] = "FAIL";
    return;
}

if ($prog =~s/^\( (..) //) {
    my $op = $1;
    my $left;
    my $right = "";
    my $in;

    if ($prog =~s/^(\([^()]*)//) {
        $in=1;
        $left = $1;
        while ($in >0) {
            if ($prog =~s/^([^()]*)//) {
                $left .= $1;
            }
            if ($prog =~s/^(\([^()]*)//) {
                $in+=1;
                $left .= $1;
            }
            if ($prog =~s/^(\)\s+)//) {
                $in-=1;
                $left .= $1;
            }
        }
    } else {
        $prog =~s/(.) //;
        $left = $1;
    }

    if ($prog =~s/^(\([^()]*)//) {
        $in=1;
        $right = $1;
        while ($in >0) {
            if ($prog =~s/^([^()]*)//) {
                $right .= $1;
            }
            if ($prog =~s/^(\([^()]*)//) {
                $in+=1;
                $right .= $1;
            }
            if ($prog =~s/^(\)\s+)//) {
                $in-=1;
                $right .= $1;
            }
        }
    } else {
        $prog =~s/(.) // ;
        if ($1 ne ")") {
            $right = $1;
        }
    }

    if ($step < 4) {
        $xform[$pos] = "FAIL"
    } else {
        $xform[$pos] = $op;
        $nodes++;
        place($left,$pos+2,int($step/2));
        if ($right) {
            place($right,$pos+$step,int($step/2));
        }
    }
} else {
    # No tree below this node
    $prog =~s/ *$//;
    $xform[$pos] = $prog;
    $nodes++;
} 

return;

}

while (<>) { @xform = ("Null") x 256; $nodes=0; /X (. )Y (. )(Z .*)$/ || die "Bad syntax on input $"; my $progA = $1; my $progB = $2; my $transform = $3; place($progA,0,128); place($progB,1,128); my $merged = join(" ",@xform); die "Nodes > 100" if $nodes > 100; die "Programs too deep: X $progA Y $progB $transform with $merged" if $merged =~/FAIL/; $ = $merged;

@xform = ("Null") x 256;
my @nodes = ("Null") x 256;

my $i = 0;
my $node=0;
while (s/^(\S+)\s*//){
  if ($1 ne "Null") {
    $xform[$i] = $1;
    $nodes[$i] = $node;
    print "$1 ";
    $node++;
  }
  $i++;
}
print "<EOT> ";
my @features = ("<unk> ") x ($node+1);
my @L_in = ("") x ($node+1);
my @R_in = ("") x ($node+1);
my @LL_in = ("") x ($node+1);
my @LR_in = ("") x ($node+1);
my @RL_in = ("") x ($node+1);
my @RR_in = ("") x ($node+1);
my @LO_in = ("") x ($node+1);
my @Strt_in = ("") x ($node+1);
my @End_in = ("") x ($node+1);
$features[$node] = "14 ";
foreach my $lvl0 ( 0 , 1 ) {
  $features[$nodes[$lvl0]] = $lvl0." ";
  if ($xform[$lvl0+2] ne "Null") {
    edges($lvl0,128,\@xform,\@nodes,\@L_in,\@R_in,\@LO_in,\@LL_in,\@RL_in,\@LR_in,\@RR_in);
  }
  foreach my $lvl1 ( $lvl0+2 , $lvl0+128 ) {
    if ($xform[$lvl1] eq "Null") {
      next;
    }
    $features[$nodes[$lvl1]] = ($lvl0+2)." ";
    if ($xform[$lvl1+2] ne "Null") {
      edges($lvl1,64,\@xform,\@nodes,\@L_in,\@R_in,\@LO_in,\@LL_in,\@RL_in,\@LR_in,\@RR_in);
    }
    foreach my $lvl2 ( $lvl1+2 , $lvl1+64 ) {
      if ($xform[$lvl2] eq "Null") {
        next;
      }
      $features[$nodes[$lvl2]] = ($lvl0+4)." ";
      if ($xform[$lvl2+2] ne "Null") {
        edges($lvl2,32,\@xform,\@nodes,\@L_in,\@R_in,\@LO_in,\@LL_in,\@RL_in,\@LR_in,\@RR_in);
      }
      foreach my $lvl3 ( $lvl2+2 , $lvl2+32 ) {
        if ($xform[$lvl3] eq "Null") {
          next;
        }
        $features[$nodes[$lvl3]] = ($lvl0+6)." ";
        if ($xform[$lvl3+2] ne "Null") {
          edges($lvl3,16,\@xform,\@nodes,\@L_in,\@R_in,\@LO_in,\@LL_in,\@RL_in,\@LR_in,\@RR_in);
        }
        foreach my $lvl4 ( $lvl3+2 , $lvl3+16 ) {
          if ($xform[$lvl4] eq "Null") {
            next;
          }
          $features[$nodes[$lvl4]] = ($lvl0+8)." ";
          if ($xform[$lvl4+2] ne "Null") {
            edges($lvl4,8,\@xform,\@nodes,\@L_in,\@R_in,\@LO_in,\@LL_in,\@RL_in,\@LR_in,\@RR_in);
          }
          foreach my $lvl5 ( $lvl4+2 , $lvl4+8 ) {
            if ($xform[$lvl5] eq "Null") {
              next;
            }
            $features[$nodes[$lvl5]] = ($lvl0+10)." ";
            if ($xform[$lvl5+2] ne "Null") {
              $features[$nodes[$lvl5+2]] = ($lvl0+12)." ";
              if ($xform[$lvl5+4] ne "Null") {
                $features[$nodes[$lvl5+4]] = ($lvl0+12)." ";
                $L_in[$nodes[$lvl5]] = $nodes[$lvl5+2]." ";
                $R_in[$nodes[$lvl5]] = $nodes[$lvl5+4]." ";
              } else {
                $LO_in[$nodes[$lvl5]] = $nodes[$lvl5+2]." ";
              }
            }
          }
        }
      }
    }
  }
}
print join("",@features);
print "<EOT> ";
for (my $i = 0; $i <= $node; $i++) {
  if ($L_in[$i] ne "") {
    print ("$i $L_in[$i]");
  }
}
print ", ";
for (my $i = 0; $i <= $node; $i++) {
  if ($R_in[$i] ne "") {
    print ("$i $R_in[$i]");
  }
}
print ", ";
for (my $i = 0; $i <= $node; $i++) {
  if ($LL_in[$i] ne "") {
    print ("$i $LL_in[$i]");
  }
}
print ", ";
for (my $i = 0; $i <= $node; $i++) {
  if ($LR_in[$i] ne "") {
    print ("$i $LR_in[$i]");
  }
}
print ", ";
for (my $i = 0; $i <= $node; $i++) {
  if ($RL_in[$i] ne "") {
    print ("$i $RL_in[$i]");
  }
}
print ", ";
for (my $i = 0; $i <= $node; $i++) {
  if ($RR_in[$i] ne "") {
    print ("$i $RR_in[$i]");
  }
}
print ", ";
for (my $i = 0; $i <= $node; $i++) {
  if ($LO_in[$i] ne "") {
    print ("$i $LO_in[$i]");
  }
}
print ", ";
$Strt_in[$node]="0 ";
for (my $i = 0; $i <= $node; $i++) {
  if ($Strt_in[$i] ne "") {
    print ("$i $Strt_in[$i]");
  }
}
print ", ";
$End_in[$node]="1 ";
for (my $i = 0; $i <= $node; $i++) {
  if ($End_in[$i] ne "") {
    print ("$i $End_in[$i]");
  }
}
print "X ",$progA,"Y ",$progB, $transform, "\n";

}

Anurag-Swarnim-Yadav commented 4 years ago

Hey @SteveKommrusch

Thanks a lot for the prompt reply. My input is a text file with a bunch of c++ programs or functions in it. I was wondering how I can convert my input in a format that I can feed to the ggnn.

Thanks Anurag

SteveKommrusch commented 4 years ago

Anurag,

Parsing CPP sounds really interesting, I’m happy to help you make progress.

You could take a look at ANTLR: https://www.antlr.org/ I’ve used it for C source code parsing (but not C++). Here is the C++ grammar file: https://github.com/antlr/grammars-v4/blob/master/cpp/CPP14.g4

After downloading ANTLR, you can use it to create a tree output using these commands: alias grun='java -Xmx500M -cp "/usr/local/lib/antlr-4.7.1-complete.jar:$CLASSPATH" org.antlr.v4.gui.TestRig' alias antlr4='java -Xmx500M -cp "/usr/local/lib/antlr-4.7.1-complete.jar:$CLASSPATH" org.antlr.v4.Tool' antlr4 CPP14.g4 javac C*.java grun C compilationUnit -tree yourfile.cpp

BTW, I’m on vacation this week, but will be able to help more starting 6/29.

Regards, Steve

From: Anurag Swarnim Yadav Sent: Monday, June 22, 2020 12:45 PM To: SteveKommrusch/OpenNMT-py-ggnn-example Cc: Steve Kommrusch; Mention Subject: Re: [SteveKommrusch/OpenNMT-py-ggnn-example] How to convert regularsource code file into ggnn (#1)

Hey @SteveKommrusch Thanks a lot for the prompt reply. My input is a text file with a bunch of c++ programs or functions in it. I was wondering how I can convert my input in a format that I can feed to the ggnn. Thanks Anurag — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.

Anurag-Swarnim-Yadav commented 4 years ago

Hey @SteveKommrusch First of all, thank you very much for pointing me in the right direction. I will go ahead and look into the tool and see If I can parse the C program. Also, I will update you with the results. Apart from this tool, we have cppminer and astminer that can do our job too. Let me know if you want to discuss on this.

Please let me know when you come back.

Enjoy your vacation

Thanks Anurag

Anurag-Swarnim-Yadav commented 4 years ago

Hey @SteveKommrusch,

I got ANTLR working. Please let me know your availability whenever possible. So that we can convert it into the ggnn input.

Thanks Anurag

SteveKommrusch commented 3 years ago

To close out this issue; Anurag and I continued working through email and I have created a documented example (recently updated to use the GGNN embedding layer as of Jan 2021). The example is documented in the primary README.md for this repo.

SamraMehboob commented 2 years ago

Hi, Thank you for this amazing work. Can we use src/raw2graph.pl to generate input for Java files. I've used ANTLR via [Astminer] to parse java files. The output ASTs are in the following form:

{"label":"1542.java","ast":[{"token":"","typeLabel":"compilationUnit","children":[1,2]},{"token":"date|time|parser","typeLabel":"IDENTIFIER","children":[]},{"token":"parser","typeLabel":"IDENTIFIER","children":[]}]}

Should I give this as input to src/raw2graph.pl to generate input suitable for ggnn. I've dataset of Java files and I want to train this model on that. Thank you.

SteveKommrusch commented 2 years ago

Hello. The README.md file describes the example setup and use of the environment. If you look in src/setupgraph2seq.sh you can see how raw2graph.pl is used. For this example, the ANTLR files are in data/antlr_files and the AST is given with a parenthetical syntax. setupgraph2seq.sh will process this output and add the target for the model to predict - in this case the target is the filename of the antlr file (the filenames are the method names to predict). Note that this example doesn't have enough samples to train well, it is just an example of how to set up the system with ANTLR.

I used this ggnn logic for this paper about program equivalence: https://arxiv.org/pdf/2106.02452.pdf, but that did not leverage ANTLR files. I think you could use ANTLR for java with this ggnn so long as you convert to a parenthesis format. (Note, it is not good to have very deep AST trees, so processing the ANTLR in some way to remove unecessary layers is useful).

I found that for the problem of program equivalence I had more success with a transformer model as detailed in this paper: https://arxiv.org/abs/2109.10476