json to g4 only with "parser" cause some syntax error

0x7Fancy commented 10 months ago

In my experimental environment, I found json to g4 only with "parser" cause some syntax error, syntax parsing errors may lead to the possibility of losing a large amount of mutated data.

I made mincase lex.json:

{
    "<A>": [["<NUMBER>", "<STRING>", "\n"]],
    "<NUMBER>": [["10"], ["99"]],
    "<STRING>": [["(", "<HEXSTRING>", ")"]],
    "<HEXSTRING>": [["<CHAR>", "<HEXSTRING>"], []],
    "<CHAR>": [
            ["0"], ["1"], ["2"], ["3"], ["4"], ["5"], ["6"], ["7"],
            ["8"], ["9"], ["a"], ["b"], ["c"], ["d"], ["e"], ["f"]
    ]
}

Grammar-Mutator make it, generate Grammar.g4 is:

grammar Grammar;
entry
    : node_A EOF
    ;
node_A
    : node_NUMBER node_STRING '\n'
    ;
node_NUMBER
    : '10'
    | '99'
    ;
node_STRING
    : '(' node_HEXSTRING ')'
    ;
node_HEXSTRING
    : 
    | node_CHAR node_HEXSTRING
    ;
node_CHAR
    : '0'
    | '1'
    | '2'
    | '3'
    | '4'
    | '5'
    | '6'
    | '7'
    | '8'
    | '9'
    | 'a'
    | 'b'
    | 'c'
    | 'd'
    | 'e'
    | 'f'
    ;

we prepared input data seed1 / seed2, and use antlr4-parse to testing:

Screen Shot 2024-01-18 at 17 03 03

why is 10(10) parsed incorrectly? because antlr4 is divided into two stages: lexer and parser. during lexer stage, node_NUMBER:10 will be recognized as TOKEN, and in the parser stage, the result is node_NUMBER (node_NUMBER), so an error occurred.

in the antlr4 grammar, lex rules begin with an uppercase letter, parser rules begin with a lowercase letter, so we should tell antlr4 the lexical rules clearly, patch Grammar_patch.g4:

grammar Grammar_patch;
entry
    : node_A EOF
    ;
node_A
    : node_NUMBER Node_STRING '\n'
    ;
node_NUMBER
    : '10'
    | '99'
    ;
Node_STRING
    : '(' Node_HEXSTRING ')'
    ;
Node_HEXSTRING
    : 
    | Node_CHAR Node_HEXSTRING
    ;
Node_CHAR
    : '0'
    | '1'
    | '2'
    | '3'
    | '4'
    | '5'
    | '6'
    | '7'
    | '8'
    | '9'
    | 'a'
    | 'b'
    | 'c'
    | 'd'
    | 'e'
    | 'f'
    ;

testing again:

Screen Shot 2024-01-18 at 17 18 58

the "warning" prompts us it can match the empty string, this may cause antlr4 parsing backtrace issues, but we can easily mark it with fragment Node_HEXSTRING

maybe we can optimize the json to g4 generation code, to distinguish between lexer and parser?

h1994st commented 10 months ago

If you have time to work on this, that would be great! Feel free to submit a PR and add your test cases.

I can take a look at this later when I am available. Probably not in recent weeks :(

0x7Fancy commented 10 months ago

okay, I'll try my best to provide

0x7Fancy commented 10 months ago

I tried to solve the problem but I found that Grammar-Mutator has to rely on AST to work, if I add lexical rules, this is not transparent to Grammar-Mutator.

In the above example, we focus on the input data of 10(12), respectively using Grammar.g4 and Grammar_patch.g4, we can see that Grammar.g4 only with grammar parser has a complete AST structure (in line with Grammar-Mutator expectations)

I think this is a trade-off in the Grammar-Mutator design. It loses part of the mutation data, but the program design is more concise, clear and direct.

so, currently Grammar-Mutator is perfect

AFLplusplus / Grammar-Mutator

json to g4 only with "parser" cause some syntax error #43