AFLplusplus / Grammar-Mutator

A grammar-based custom mutator for AFL++
Apache License 2.0
229 stars 17 forks source link

json to g4 only with "parser" cause some syntax error #43

Open 0x7Fancy opened 8 months ago

0x7Fancy commented 8 months ago

In my experimental environment, I found json to g4 only with "parser" cause some syntax error, syntax parsing errors may lead to the possibility of losing a large amount of mutated data.

I made mincase lex.json:

{
    "<A>": [["<NUMBER>", "<STRING>", "\n"]],
    "<NUMBER>": [["10"], ["99"]],
    "<STRING>": [["(", "<HEXSTRING>", ")"]],
    "<HEXSTRING>": [["<CHAR>", "<HEXSTRING>"], []],
    "<CHAR>": [
            ["0"], ["1"], ["2"], ["3"], ["4"], ["5"], ["6"], ["7"],
            ["8"], ["9"], ["a"], ["b"], ["c"], ["d"], ["e"], ["f"]
    ]
}

Grammar-Mutator make it, generate Grammar.g4 is:

grammar Grammar;
entry
    : node_A EOF
    ;
node_A
    : node_NUMBER node_STRING '\n'
    ;
node_NUMBER
    : '10'
    | '99'
    ;
node_STRING
    : '(' node_HEXSTRING ')'
    ;
node_HEXSTRING
    : 
    | node_CHAR node_HEXSTRING
    ;
node_CHAR
    : '0'
    | '1'
    | '2'
    | '3'
    | '4'
    | '5'
    | '6'
    | '7'
    | '8'
    | '9'
    | 'a'
    | 'b'
    | 'c'
    | 'd'
    | 'e'
    | 'f'
    ;

we prepared input data seed1 / seed2, and use antlr4-parse to testing:

Screen Shot 2024-01-18 at 17 03 03

why is 10(10) parsed incorrectly? because antlr4 is divided into two stages: lexer and parser. during lexer stage, node_NUMBER:10 will be recognized as TOKEN, and in the parser stage, the result is node_NUMBER (node_NUMBER), so an error occurred.

in the antlr4 grammar, lex rules begin with an uppercase letter, parser rules begin with a lowercase letter, so we should tell antlr4 the lexical rules clearly, patch Grammar_patch.g4:

grammar Grammar_patch;
entry
    : node_A EOF
    ;
node_A
    : node_NUMBER Node_STRING '\n'
    ;
node_NUMBER
    : '10'
    | '99'
    ;
Node_STRING
    : '(' Node_HEXSTRING ')'
    ;
Node_HEXSTRING
    : 
    | Node_CHAR Node_HEXSTRING
    ;
Node_CHAR
    : '0'
    | '1'
    | '2'
    | '3'
    | '4'
    | '5'
    | '6'
    | '7'
    | '8'
    | '9'
    | 'a'
    | 'b'
    | 'c'
    | 'd'
    | 'e'
    | 'f'
    ;

testing again:

Screen Shot 2024-01-18 at 17 18 58

the "warning" prompts us it can match the empty string, this may cause antlr4 parsing backtrace issues, but we can easily mark it with fragment Node_HEXSTRING

maybe we can optimize the json to g4 generation code, to distinguish between lexer and parser?

h1994st commented 8 months ago

If you have time to work on this, that would be great! Feel free to submit a PR and add your test cases.

I can take a look at this later when I am available. Probably not in recent weeks :(

0x7Fancy commented 8 months ago

okay, I'll try my best to provide

0x7Fancy commented 8 months ago

I tried to solve the problem but I found that Grammar-Mutator has to rely on AST to work, if I add lexical rules, this is not transparent to Grammar-Mutator.

In the above example, we focus on the input data of 10(12), respectively using Grammar.g4 and Grammar_patch.g4, we can see that Grammar.g4 only with grammar parser has a complete AST structure (in line with Grammar-Mutator expectations)

I think this is a trade-off in the Grammar-Mutator design. It loses part of the mutation data, but the program design is more concise, clear and direct.

so, currently Grammar-Mutator is perfect