dolik-rce / pegof

PEG grammar optimizer and formatter
Other
9 stars 2 forks source link

The actual sources need -std=c++17 to compile #1

Closed mingodad closed 2 years ago

mingodad commented 2 years ago

I'm on Ubuntu 18.04 and using gcc9 and to be able to build I needed to add this to CMakefile.txt:

diff --git a/CMakeLists.txt b/CMakeLists.txt
index a1ba459..4ccced3 100644
--- a/CMakeLists.txt
+++ b/CMakeLists.txt
@@ -22,3 +22,5 @@ add_custom_command(
 add_custom_target(test tests/run.sh DEPENDS pegof WORKING_DIRECTORY ${CMAKE_CURRENT_SOURCE_DIR})

 target_include_directories(pegof BEFORE PRIVATE ${CMAKE_CURRENT_SOURCE_DIR} ${CMAKE_CURRENT_BINARY_DIR})
+
+target_compile_features(pegof PUBLIC cxx_std_17)
mingodad commented 2 years ago

Also when trying to use the grammar shown bellow I'm getting syntax error, it seems that the problem is this rule:

definition      <- {printf("\n\n");} name  {printf(" ::=\n\t");} S (asttag S)? arrow exp

I'm using an action before a pattern/element -> name, packcc does accept it.

Grammar that give the problem:

%prefix "lpeg_re"

%header {
    typedef struct  {
        int doOutput;
    } lpeg_re_st;
}

%auxil "lpeg_re_st *"

pattern         <- exp !. {printf("\n\n");}
exp             <- S (grammar / alternative)

alternative     <- seq (<'/'> {printf("\n\t%s ", $1);} S seq)*
seq             <- prefix*
prefix          <- '&' S prefix / '!' S prefix / suffix
suffix          <- primary S ((<[+*?]> {printf(" %s ", $1);}
                            / '^' [+-]? num
                            / '->' {auxil->doOutput=0;} S (string / '{}' / name / num) {auxil->doOutput=1;}
                            / '=>' {auxil->doOutput=0;} S name) S)* {auxil->doOutput=1;}

primary         <- '(' {printf(" ( ");} exp  ')' {printf(" ) ");} / (string / keyword) / class / defined
                 / '{:' (name ':')? exp ':}'
                 / '=' name
                 / '@' exp
                 / '{}'
                 / '{~' exp '~}'
                 / '{|' exp '|}'   # missing table capture
                 / '{' exp '}'
         / '~?' # Expected match
         / '~>' S ( 'foldleft' / 'foldright' / 'rfoldleft' / 'rfoldright' )
         / '$' (string / name / num) # Arbitrary capture
                 / '.'
                 / name S !(asttag / arrow )
                 / '<' name '>'          ## old-style non terminals
         / '^' name
         / '%{' S name S '}'

grammar         <- definition+
definition      <- {printf("\n\n");} name  {printf(" ::=\n\t");} S (asttag S)? arrow exp

class           <- <'[' '^'? item (!']' item)* ']'> { printf(" %s ", $1);}
item            <- defined / range / .
range           <- . '-' [^\]]

S               <- ([ \t\f\r\n]  /  '--' [^\r\n]*)*  # spaces and comments
name            <- <[A-Za-z_]([A-Za-z0-9_] / '-' !'>' )*> { if(auxil->doOutput) printf(" %s ", $1);}
arrow           <- (  '<--' / '<==' / '<-|'  / '<-' )
num             <- [0-9]+
string          <- <'"' [^"]* '"' / "'" [^']* "'">  { if(auxil->doOutput) printf(" %s ", $1);}
defined         <- '%' name
keyword     <-  '`' <[^`]+> '`' { printf(" '%s' ", $1);}
asttag         <- ':' S name

%%
int main() {
    lpeg_re_st state;
    state.doOutput = 1;
    lpeg_re_context_t *ctx = lpeg_re_create(&state);
    while (lpeg_re_parse(ctx, NULL));
    lpeg_re_destroy(ctx);
    return 0;
}
mingodad commented 2 years ago

Also removing directives (%name ({} / string)) from the parser.peg and using this tool https://www.bottlecaps.de/convert/ to convert it to an EBNF understood by https://www.bottlecaps.de/rr/ui we can get a nice railroad diagram (https://en.wikipedia.org/wiki/Syntax_diagram).

Copy and paste the EBNF shown bellow at https://www.bottlecaps.de/rr/ui on the tab Edit Grammar then click on the tab View Diagram.

/* converted on Wed Apr 20, 2022, 10:59 (UTC+02) by pegjs-to-w3c v0.57 which is Copyright (c) 2011-2022 by Gunther Rademacher <grd@gmx.net> */

grammar  ::= _* ( statement _* )+ code? EOF
statement
         ::= directive
           | rule
           | line_comment
           | suffix_comment
rule     ::= identifier _* '<-' _* alternation_or_sequence_or_primary
directive
         ::= '%' ( 'value' | 'header' )
           | 'auxil'
           | 'prefix' _ string
           | 'source'
           | 'common'
           | 'earlyheader'
           | 'earlysource'
           | 'earlycommon' _ source
alternation_or_sequence_or_primary
         ::= sequence_or_primary ( _* '/' _* sequence_or_primary )*
sequence_or_primary
         ::= primary ( _* primary )*
primary  ::= suffix_comment
           | prefix_op? _* literal _* postfix_op? ( _* ( source | error ) )?
error    ::= '~' _* '{' source_code '}'
source   ::= '{' source_code '}'
char_class
         ::= '[' '^'? ( character ( '-' character )? )+ ']'
ruleref  ::= identifier ( ':' identifier )?
EOF      ::=

<?TOKENS?>

code     ::= '%%' [ #x9#xD#xA]* .*
literal  ::= ruleref
           | string
           | char_class
           | '.'
           | '$' [1-9] [0-9]*
           | '<' _* alternation_or_sequence_or_primary _* '>'
           | '(' _* alternation_or_sequence_or_primary _* ')'
string   ::= '"' ( '\' . | [^"] )* '"'
           | "'" ( '\' . | [^'] )* "'"
source_code
         ::= ( [^{}]* ( '{' source_code '}' )? )*
character
         ::= '\' .
           | [^#x5D]
identifier
         ::= [_a-zA-Z] [_a-zA-Z0-9]*
prefix_op
         ::= [&!]
postfix_op
         ::= [?*+]
line_comment
         ::= #xA [ #x9]* '#' [ #x9]* [^#xA]*
suffix_comment
         ::= _* '#'+ [ #x9]* [^#xA]*
_        ::= [ #x9#xD#xA]+
mingodad commented 2 years ago

With this changes to parser.peg I can use the problematic grammar shown here https://github.com/dolik-rce/pegof/issues/1#issuecomment-1103652933 :

diff --git a/parser.peg b/parser.peg
index 2103c92..44ed942 100644
--- a/parser.peg
+++ b/parser.peg
@@ -90,6 +90,10 @@ sequence
     }

 primary
+    <- <s:source> pb:primary_base { if(s) {$$ = s; append_child($$, pb);} else $$ = pb; }
+    / pb:primary_base { $$ = pb; }
+
+primary_base
     <- c:suffix_comment { $$ = c; }
     / (pr:prefix_op? _* l:literal _* po:postfix_op? (_* <s:source> / _* <e:error>)?) {
         if (!(pr || po || s || e )) {

And here is all my changes so far that also remove compiler warnings pointed by gcc9:

diff --git a/CMakeLists.txt b/CMakeLists.txt
index a1ba459..4ccced3 100644
--- a/CMakeLists.txt
+++ b/CMakeLists.txt
@@ -22,3 +22,5 @@ add_custom_command(
 add_custom_target(test tests/run.sh DEPENDS pegof WORKING_DIRECTORY ${CMAKE_CURRENT_SOURCE_DIR})

 target_include_directories(pegof BEFORE PRIVATE ${CMAKE_CURRENT_SOURCE_DIR} ${CMAKE_CURRENT_BINARY_DIR})
+
+target_compile_features(pegof PUBLIC cxx_std_17)
diff --git a/ast.cc b/ast.cc
index 0d3ce59..9f9f5d1 100644
--- a/ast.cc
+++ b/ast.cc
@@ -54,7 +54,7 @@ const char* AstNode::get_type_name() const {
     case AST_GROUP:             return "GROUP";
     case AST_CAPTURE:           return "CAPTURE";
     default:
-        Io::debug("ERROR: unexpected AST node type!\n");
+        Io::debug("%s\n", "ERROR: unexpected AST node type!");
         exit(2);
     }
 }
@@ -138,14 +138,14 @@ void AstNode::replace_child(AstNode* removed, AstNode* added, bool do_delete) {
 }

 void AstNode::debug() {
-    Io::print("### Original AST:\n");
+    Io::print("%s\n", "### Original AST:");
     print_ast();
-    Io::print("### Original formatted:\n");
+    Io::print("%s\n", "### Original formatted:");
     format();
     optimize();
-    Io::print("### Optimized AST:\n");
+    Io::print("%s\n", "### Optimized AST:");
     print_ast();
-    Io::print("### Optimized formatted:\n");
+    Io::print("%s\n", "### Optimized formatted:");
     format();
 }

diff --git a/ast_format.cc b/ast_format.cc
index 9b78fc6..890c3da 100644
--- a/ast_format.cc
+++ b/ast_format.cc
@@ -72,7 +72,7 @@ void AstNode::format() const {
         format_terminal();
         break;
     default:
-        Io::debug("ERROR: unexpected AST node type!\n");
+        Io::debug("%s\n", "ERROR: unexpected AST node type!");
         exit(2);
     }
 }
@@ -125,7 +125,7 @@ void AstNode::format_source() const {
     int is_directive = parent->type == AST_DIRECTIVE;

     if (has_newlines || is_c_directive) {
-        Io::print(" {\n");
+        Io::print("%s\n", " {");
         reindent(text, is_directive ? 4 : 8);
         Io::print("%s}", is_directive ? "" : "    ");
     } else {
@@ -134,20 +134,20 @@ void AstNode::format_source() const {
 }

 void AstNode::format_error() const {
-    Io::print(" ~");
+    Io::print("%s", " ~");
     format_source();
 }

 void AstNode::format_directive() const {
     AstNode* content = children[1];

-    Io::print("%%");
+    Io::print("%s", "%");
     children[0]->format();
     if (content->type == AST_STRING) {
-        Io::print(" ");
+        Io::print("%s", " ");
     }
     content->format();
-    Io::print(parent->children.back() == this ? "\n" : "\n\n");
+    Io::print("%s", parent->children.back() == this ? "\n" : "\n\n");
 }

 void AstNode::format_alternation() const {
@@ -164,7 +164,7 @@ void AstNode::format_alternation() const {
 void AstNode::format_sequence() const {
     for (size_t i = 0; i < children.size(); i++) {
         if (i > 0) {
-            Io::print(" ");
+            Io::print("%s", " ");
         }
         children[i]->format();
     }
@@ -186,7 +186,7 @@ void AstNode::format_ruleref() const {
     children[0]->format();
     if (children.size() == 2) {
         // has variable
-        Io::print(":");
+        Io::print("%s", ":");
         children[1]->format();
     }
 }
@@ -196,7 +196,7 @@ void AstNode::format_rule() const {
     bool hasAlternation = body->type == AST_ALTERNATION && body->children.size() > 1;
     Io::print("%s%s <- ", children[0]->text.c_str(), hasAlternation ? "\n   " : "");
     body->format();
-    Io::print(parent->children.back() == this ? "\n" : "\n\n");
+    Io::print("%s", parent->children.back() == this ? "\n" : "\n\n");
 }

 void AstNode::format_code() const {
diff --git a/ast_opt.cc b/ast_opt.cc
index a78484a..41762e9 100644
--- a/ast_opt.cc
+++ b/ast_opt.cc
@@ -99,7 +99,7 @@ int AstNode::optimize_repeats() {
     }

     if (first_op == '*' && second_op == '*') {
-        Io::debug("  Removing unnecessary repeated node (X* X* -> X*)\n");
+        Io::debug("%s\n", "  Removing unnecessary repeated node (X* X* -> X*)");
         second->parent->remove_child(second);
     } else if (first_op == '*' || second_op == '*') {
         Io::debug("  Removing unnecessary repeated node (X%c X%c -> X+)\n", first_op, second_op);
@@ -145,7 +145,7 @@ int AstNode::optimize_children() {
 }

 int AstNode::optimize_strip_comment() {
-    Io::debug("  Removing comment\n");
+    Io::debug("%s\n", "  Removing comment");
     parent->remove_child(this);
     return 1;
 }
@@ -313,6 +313,6 @@ int AstNode::optimize() {
     case AST_BACKREF:
         return optimize_children();
     }
-    Io::debug("ERROR: unexpected AST node type!\n");
+    Io::debug("%s\n", "ERROR: unexpected AST node type!");
     exit(2);
 }
diff --git a/io.cc b/io.cc
index 55aa4e6..97afb36 100644
--- a/io.cc
+++ b/io.cc
@@ -12,7 +12,7 @@ FILE *Io::output = stdin;
 void Io::open(const std::string& input_path, const std::string& output_path) {
     std::stringstream stream;
     if (input_path.empty()) {
-        Io::debug("Reading grammar from stdin\n");
+        Io::debug("%s\n", "Reading grammar from stdin");
         stream << std::cin.rdbuf();
     } else {
         Io::debug("Reading grammar from %s\n", input_path.c_str());
diff --git a/parser.peg b/parser.peg
index 2103c92..44ed942 100644
--- a/parser.peg
+++ b/parser.peg
@@ -90,6 +90,10 @@ sequence
     }

 primary
+    <- <s:source> pb:primary_base { if(s) {$$ = s; append_child($$, pb);} else $$ = pb; }
+    / pb:primary_base { $$ = pb; }
+
+primary_base
     <- c:suffix_comment { $$ = c; }
     / (pr:prefix_op? _* l:literal _* po:postfix_op? (_* <s:source> / _* <e:error>)?) {
         if (!(pr || po || s || e )) {
mingodad commented 2 years ago

It seems that the tests are broken, running the tests before and after my changes gives the same result 23 tests, 19 failures, 1 skipped .

mingodad commented 2 years ago

When I ask pegof to optimize the the grammar shown above and pass the result to packcc I'm getting this error:

packcc -o lpegrex lpegrex.peg 
packcc: lpegrex.peg:45:1: Illegal rule syntax
packcc: lpegrex.peg:45:21: Invalid directive
packcc: lpegrex.peg:45:22: Illegal rule syntax
packcc: lpegrex.peg:45:53: Invalid directive
packcc: lpegrex.peg:45:54: Illegal rule syntax
packcc: lpegrex.peg:45:96: Invalid directive
packcc: lpegrex.peg:45:99: Illegal rule syntax
packcc: lpegrex.peg:25:7: No definition of rule 'class'

Optimized grammar:

%prefix "lpeg_re"

%header {
    typedef struct  {
    int doOutput;
    } lpeg_re_st;
}

%auxil "lpeg_re_st *"

pattern <- S (definition+ / alternative) !. { printf("\n\n"); }

alternative <- prefix* (<"/"> { printf("\n\t%s ", $1); } S prefix*)*

prefix
    <- "&" S prefix
    / "!" S prefix
    / suffix

suffix <- primary S ((<[*+?]> { printf(" %s ", $1); } / "^" [+-]? [0-9]+ / "->" { auxil->doOutput=0; } S (string / "{}" / name / [0-9]+) { auxil->doOutput=1; } / "=>" { auxil->doOutput=0; } S name) S)* { auxil->doOutput=1; }

primary
    <- "(" { printf(" ( "); } S (definition+ / alternative) ")" { printf(" ) "); }
    / (string / keyword)
    / class
    / "%" name
    / "{:" (name ":")? S (definition+ / alternative) ":}"
    / "=" name
    / "@" S (definition+ / alternative)
    / "{}"
    / "{~" S (definition+ / alternative) "~}"
    / "{|" S (definition+ / alternative) "|}"
    / "{" S (definition+ / alternative) "}"
    / "~?"
    / "~>" S ("foldleft" / "foldright" / "rfoldleft" / "rfoldright")
    / "$" (string / name / [0-9]+)
    / "."
    / name S !(":" S name / ("<--" / "<==" / "<-|" / "<-"))
    / "<" name ">"
    / "^" name
    / "%{" S name S "}"

definition <-  { printf("\n\n"); } S (":" S name S)? ("<--" / "<==" / "<-|" / "<-") S (definition+ / alternative)

class <- <"[" "^"? "%" name / . "-" [^]] / . (!"]" "%" name / . "-" [^]] / .)* "]"> { printf(" %s ", $1); }

S <- ([\t\n\r f] / "--" [^\n\r]*)*

name <- <[A-Z_a-z] ([0-9A-Z_a-z] / "-" !">")*> { if(auxil->doOutput) printf(" %s ", $1); }

string <- <"\"" [^"]* "\"" / "'" [^']* "'"> { if(auxil->doOutput) printf(" %s ", $1); }

keyword <- "`" <[^`]+> "`" { printf(" '%s' ", $1); }

%%
int main() {
    lpeg_re_st state;
    state.doOutput = 1;
    lpeg_re_context_t *ctx = lpeg_re_create(&state);
    while (lpeg_re_parse(ctx, NULL));
    lpeg_re_destroy(ctx);
    return 0;
}
mingodad commented 2 years ago

Using this tool https://www.bottlecaps.de/convert/ (after removing directives/actions) it pointed out the error as shown bellow and looking through the code it seems that pegof is not outputting correctly this [^\]] -> [^]], manually fixing it then packcc accepts the optimized grammar.

lexical analysis failed
while expecting Character
at line 35, column 39:
...]] / . (!"]" "%" name / . "-" [^]] / .)* "]"> 

S <- ([\t\n\r ...
mingodad commented 2 years ago

After debugging a bit I found this fix to the escape problem, that also remove an unused variable and add missing escaped char \f, but the resulting optimized grammar that is accepted by packcc doesn't output the same results ans the original:

diff --git a/char_class.cc b/char_class.cc
index 760d93d..a5f205f 100644
--- a/char_class.cc
+++ b/char_class.cc
@@ -10,6 +10,8 @@ int CharacterClass::get_char(const std::string& s, size_t& pos) {
         case 'n': return '\n';
         case 't': return '\t';
         case 'v': return '\v';
+        case 'f': return '\f';
+        default: --pos; //any other escaped char remain escaped
         }
     }
     return s[pos-1];
@@ -21,6 +23,7 @@ std::string CharacterClass::to_char(int c) {
     case '\n': return "\\n";
     case '\t': return "\\t";
     case '\v': return "\\v";
+    case '\f': return "\\f";
     default: return std::string(1, (char)c);
     }
 }
@@ -42,7 +45,6 @@ CharacterClass::Tokens CharacterClass::tokenize(const std::string& input) {
     Tokens tokens;
     size_t pos = 0;
     while (pos < input.size()) {
-        Token r;
         tokens.push_back(get_range(input, pos));
     }

Example input:

text <- {~ item* ~}
item <- macro / [^()] / '(' item* ')'
arg <- ' '* {~ (!',' item)* ~}
args <- '(' arg (',' arg)* ')'
-- now we define some macros
macro <- ('apply' args) -> '%1(%2)'
    / ('add' args) -> '%1 + %2'
    / ('mul' args) -> '%1 * %2'

Original grammar output:

cat lpeg-re-test.peg | ./lpeg-re

 text  ::=
     item  * 

 item  ::=
     macro 
    /  [^()] 
    /  '('  item  *  ')' 

 arg  ::=
     ' '  *  (  ','  item  )  * 

 args  ::=
     '('  arg  (  ','  arg  )  *  ')' 

 macro  ::=
     (  'apply'  args  ) 
    /  (  'add'  args  ) 
    /  (  'mul'  args  ) 

Optimized grammar output:

cat lpeg-re-test.peg | ./lpegrex 
Syntax error
mingodad commented 2 years ago

Comparing the two grammars I can see that there is a problem with the inlining of item inside class, but even adding parenthesis around it the output doesn't match:

class <- <"[" "^"? ("%" name / . "-" [^\]] / .) (!"]" ("%" name / . "-" [^\]] / .))* "]"> { printf(" %s ", $1); }
mingodad commented 2 years ago

This tool https://www.bottlecaps.de/convert/ reports this on the optimized grammar:

grammar was parsed OK, but conversion failed
failed to eliminate direct left recursion from this production:
  definition
           ::= ( definition+ | alternative ) S ( '<--' | '<==' | '<-|' | '<-' ) ( S name S ':' )? S
result was:
definition
           ::= ( definition+ | alternative ) S ( '<--' | '<==' | '<-|' | '<-' ) ( S name S ':' )? S ( )*
mingodad commented 2 years ago

I noticed that there is no license on/for this repository. Can you add a license ?

dolik-rce commented 2 years ago

Hi @mingodad,

Thank you for testing this tool and for providing the feedback. It's quite a lot of unrelated problems, I'll probably split it into multiple issues for easier tracking and start to fix the problems as soon as I have some spare time.

Most of the stuff seems pretty straightforward, and you even proposed fixes, so it should not take much time.

dolik-rce commented 2 years ago

This tool https://www.bottlecaps.de/convert/ reports this on the optimized grammar:

grammar was parsed OK, but conversion failed
failed to eliminate direct left recursion from this production:
  definition
           ::= ( definition+ | alternative ) S ( '<--' | '<==' | '<-|' | '<-' ) ( S name S ':' )? S
result was:
definition
           ::= ( definition+ | alternative ) S ( '<--' | '<==' | '<-|' | '<-' ) ( S name S ':' )? S ( )*

I'm not familiar with that tool, but I believe that it is not really a problem for pegof. Some parsers (like packcc) handle recursion just fine, while others (like many EBNF based ones) cannot cope with left recursion. Converting left-recursive rules to right-recursive is not trivial, so I guess the tool does not fully support it.

dolik-rce commented 2 years ago

I have enforced the C++17 in cmake in https://github.com/dolik-rce/pegof/commit/e810812fe57ac360a1a3d59bef791183f145b233 which should solve the original issue.

The rest of the problems is in seperate issues and I'll look at them soonish (probably next week).