antlr / antlr4

ANTLR (ANother Tool for Language Recognition) is a powerful parser generator for reading, processing, executing, or translating structured text or binary files.
http://antlr.org
BSD 3-Clause "New" or "Revised" License
17.11k stars 3.28k forks source link

regex '('([^()]*|'('[^()]*')')*')' causes python3 target to infinitely recurse #2572

Open Lyusternik opened 5 years ago

Lyusternik commented 5 years ago

This is simple regex to match nested parentheses. I have attached my simple test grammar and main to demonstrate the issue. Traceback (most recent call last): File "maintest.py", line 14, in main(sys.argv) File "maintest.py", line 11, in main tree = parser.hold() File "/home/lyusternik/dev/misc/volga/testParser.py", line 71, in hold self.enterRule(localctx, 0, self.RULE_hold) File "/home/lyusternik/dev/misc/volga/antlr4/Parser.py", line 366, in enterRule self._ctx.start = self._input.LT(1) File "/home/lyusternik/dev/misc/volga/antlr4/CommonTokenStream.py", line 61, in LT self.lazyInit() File "/home/lyusternik/dev/misc/volga/antlr4/BufferedTokenStream.py", line 186, in lazyInit self.setup() File "/home/lyusternik/dev/misc/volga/antlr4/BufferedTokenStream.py", line 189, in setup self.sync(0) File "/home/lyusternik/dev/misc/volga/antlr4/BufferedTokenStream.py", line 111, in sync fetched = self.fetch(n) File "/home/lyusternik/dev/misc/volga/antlr4/BufferedTokenStream.py", line 123, in fetch t = self.tokenSource.nextToken() File "/home/lyusternik/dev/misc/volga/antlr4/Lexer.py", line 128, in nextToken ttype = self._interp.match(self._input, self._mode) File "/home/lyusternik/dev/misc/volga/antlr4/atn/LexerATNSimulator.py", line 97, in match return self.matchATN(input) File "/home/lyusternik/dev/misc/volga/antlr4/atn/LexerATNSimulator.py", line 126, in matchATN predict = self.execATN(input, next) File "/home/lyusternik/dev/misc/volga/antlr4/atn/LexerATNSimulator.py", line 169, in execATN target = self.computeTargetState(input, s, t) File "/home/lyusternik/dev/misc/volga/antlr4/atn/LexerATNSimulator.py", line 227, in computeTargetState self.getReachableConfigSet(input, s.configs, reach, t) File "/home/lyusternik/dev/misc/volga/antlr4/atn/LexerATNSimulator.py", line 276, in getReachableConfigSet if self.closure(input, config, reach, currentAltReachedAcceptState, True, treatEofAsEpsilon): File "/home/lyusternik/dev/misc/volga/antlr4/atn/LexerATNSimulator.py", line 356, in closure currentAltReachedAcceptState = self.closure(input, c, configs, currentAltReachedAcceptState, speculative, treatEofAsEpsilon) File "/home/lyusternik/dev/misc/volga/antlr4/atn/LexerATNSimulator.py", line 356, in closure currentAltReachedAcceptState = self.closure(input, c, configs, currentAltReachedAcceptState, speculative, treatEofAsEpsilon) File "/home/lyusternik/dev/misc/volga/antlr4/atn/LexerATNSimulator.py", line 356, in closure currentAltReachedAcceptState = self.closure(input, c, configs, currentAltReachedAcceptState, speculative, treatEofAsEpsilon) [Previous line repeated 973 more times] File "/home/lyusternik/dev/misc/volga/antlr4/atn/LexerATNSimulator.py", line 351, in closure configs.add(config) File "/home/lyusternik/dev/misc/volga/antlr4/atn/ATNConfigSet.py", line 78, in add existing = self.getOrAdd(config) File "/home/lyusternik/dev/misc/volga/antlr4/atn/ATNConfigSet.py", line 100, in getOrAdd r = next((cfg for cfg in l if config.equalsForConfigSet(cfg)), None) File "/home/lyusternik/dev/misc/volga/antlr4/atn/ATNConfigSet.py", line 100, in r = next((cfg for cfg in l if config.equalsForConfigSet(cfg)), None) File "/home/lyusternik/dev/misc/volga/antlr4/atn/ATNConfig.py", line 148, in equalsForConfigSet return self==other File "/home/lyusternik/dev/misc/volga/antlr4/atn/ATNConfig.py", line 138, in eq return super().eq(other) File "/home/lyusternik/dev/misc/volga/antlr4/atn/ATNConfig.py", line 65, in eq elif not isinstance(other, ATNConfig): RecursionError: maximum recursion depth exceeded while calling a Python object` maintest.txt test.txt

grammar.txt

ericvergnaud commented 5 years ago

@parrt this also seems to be problematic in Java I just tested it in the IntelliJ preview and it doesn't behave properly (does not produce any output). I suspect there is something illegal in the proposed regex that the tool does not detect?

parrt commented 5 years ago

I think ([^()]*|'('[^()]*')')* is an optional loop with optional first alt, right? I could swear we detect something like this. maybe only if entire lexer rule is optional.