incorrect rule index deduction from ANTLR

trust1995 commented 3 years ago

During the input parsing shim, nodes are created using node = node_create_with_rule_id(non_terminal_node->getRuleIndex(), non_terminal_node->getAltNumber() - 1); However, in my tests the antlr4::ParserRuleContext node's getAltNumber() returns 0 on OUTER recursive grammar nodes. Therefore all nodes up to the inner one will have invalid rule_id.

For example, for this G4 grammar: A: B | A B B: "MYTOKEN" entry: A

The input "MYTOKEN MYTOKEN MYTOKEN" will be parsed as entry -> A -> A -> A -> B ++++++|+++| -> B ++++++| -> B

The last A will have rule_id = 0, the previous ones have rule_id = MAXUINT. While incidentally specifically here this will not screw up the fuzzer behavior, when there are various recursive expansions it is a major issue.

h1994st commented 3 years ago

Hi @trust1995

Thanks for reporting the issue.

The problem is due to your grammar, as it uses left recursion. It seems ANTLR4 cannot assign the alternative number correctly while using left recursion. It gives the recursive rule of A an invalid alternative number, 0, which results in rule_id = MAXUINT in antlr4_shim. I currently don't know why this happens. You can fire an issue to ANTLR4 community or check its source codes :)

If you convert the grammar to right recursion, antlr4_shim works well:

entry: A

A: B | B A

B: "MYTOKEN"

trust1995 commented 3 years ago

Thanks a lot for the reply, things like that really should be written somewhere :)

h1994st commented 3 years ago

@trust1995

Yeah, I am updating the README in the grammar directory to remind future users. Thanks!

AFLplusplus / Grammar-Mutator

incorrect rule index deduction from ANTLR #28