Reduce memory usage when parsing lots of different inputs.

antlr / antlr4

ANTLR (ANother Tool for Language Recognition) is a powerful parser generator for reading, processing, executing, or translating structured text or binary files.

http://antlr.org

BSD 3-Clause "New" or "Revised" License

17.11k stars 3.28k forks source link

Reduce memory usage when parsing lots of different inputs. #4232

Open Changochen opened 1 year ago

Changochen commented 1 year ago

Hi,

I am using antlr cppruntime to build a fuzzer, which needs to parse lots of different inputs of a grammar. However, the memory usage is too high (over 50G) after parsing millions of inputs. I can confirm that there are no memory leakage on my parts. I profile the memory usage and they are mostly in ParserATNSimulator.

I checked the issues (https://github.com/antlr/antlr4/issues/2182, https://github.com/antlr/antlr4/issues/499), which don't solve my problems: The memory usage is still high and clearing the cache makes the parser really slow sometime (completely stuck for a few minutes).

Every time I parse a input, I create a new parser as below:

ANTLRInputStream input(input_program);
XXXLexer lexer(&input);
CommonTokenStream tokens(&lexer);
tokens.fill();
XXXParser parser(&tokens);

Are there any ways to limit the memory usage? Any suggestions are appreciated!

jimidle commented 1 year ago

At first glance I would say:

Clear your cache more aggressively
Reuse the parser and lexer objects
Make sure that you are not keeping the token stream etc around after you have finished with them.

For Go, I will eventually make the prediction cache a tunable LRU cache for this reason. Maybe the other runtimes can copy that idea once I have done it for Go.

Changochen commented 1 year ago

Hi @jimidle, thanks for the suggestions!

Yes, I tried clearing my cache more aggressively. The memory usage is now limited, but the performance is noticeably worse. The parser, lexer and token stream will be freed automatically so I think they are not the problem of increasing memory usage (I reused them to avoid allocation overhead though).

I think it will be great that antlr provides a systematic way to limit the memory usage instead of using some hacky ways. An tunable LRU cache sounds great. Will it be also available for cpp runtime?

jimidle commented 1 year ago

I can’t speak for the authors of the other runtimes I am afraid, but I am willing to do the work in go to see if it is worth it. I’ll have to build one in go, but there might be something already available in other languages such as C++ or Java.

On Wed, Apr 19, 2023 at 03:31 Ne0 @.***> wrote:

Hi @jimidle https://github.com/jimidle, thanks for the suggestions!

Yes, I tried clearing my cache more aggressively. The memory usage is now limited, but the performance is noticeably worse. The parser, lexer and token stream will be freed automatically so I think they are not the problem of increasing memory usage (I reused them to avoid allocation overhead though).

I think it will be great that antlr provides a systematic way to limit the memory usage instead of using some hacky ways. An tunable LRU cache sounds great. Will it be also available for cpp runtime?

— Reply to this email directly, view it on GitHub https://github.com/antlr/antlr4/issues/4232#issuecomment-1513690417, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAJ7TMBJWA7YYJOJZE2URVDXB3TX5ANCNFSM6AAAAAAXAMH5E4 . You are receiving this because you were mentioned.Message ID: @.***>

mike-lischke commented 1 year ago

I'm afraid nothing you or a target author can do will significantly improve this situation, as it is by design. For every path through the ATN the parser creates an ATN configuration and caches it. The size of this cache is not limited and if you constantly feed previously unseen input, the cache will grow endlessly. Introducing an upper limit for it's size is almost like clearing the cache: the next unseen input will require ATN configurations to be created again and again, even if you repeat that input from then on. Without such a configuration the parser and lexer cannot do their job.

The only improvement I can think of is to add a timestamp to the configurations and remove the oldest ones once the cache has reached a certain limit. Though, this will help only if that old input will not occur anymore.

jimidle commented 1 year ago

Mike,

The point is to clean the cache in small increments. What we do now is cache forever and then have to start again. Hence an LRU should give finer control and a choice between recreating and increasing memory until the system can’t handle it.

On Mon, May 15, 2023 at 22:24 Mike Lischke @.***> wrote:

I'm afraid nothing you or a target author can do will significantly improve this situation, as it is by design. For every path through the ATN the parser creates an ATN configuration and caches it. The size of this cache is not limited and if you constantly feed previously unseen input, the cache will grow endlessly. Introducing an upper limit for it's size is almost like clearing the cache: the next unseen input will require ATN configurations to be created again and again, even if you repeat that input from then on. Without such a configuration the parser and lexer cannot do their job.

The only improvement I can think of is to add a timestamp to the configurations and remove the oldest ones once the cache has reached a certain limit. Though, this will help only if that old input will not occur anymore.

— Reply to this email directly, view it on GitHub https://github.com/antlr/antlr4/issues/4232#issuecomment-1547975625, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAJ7TMHDBB2SINZUOF4C6VDXGI4DVANCNFSM6AAAAAAXAMH5E4 . You are receiving this because you were mentioned.Message ID: @.***>

xuanziranhan commented 1 year ago

I'm using Java. Clearing Cache was not helping much with the memory leak, but clearing DFA every time helped. But that hurt performance pretty badly. I'd like to have flexible options on the static structure (cache and the DFA[]), if we could choose which cache/structure for DFA states, that'd be helpful.

MarkBHuawei commented 6 months ago

I'm afraid nothing you or a target author can do will significantly improve this situation, as it is by design. For every path through the ATN the parser creates an ATN configuration and caches it. The size of this cache is not limited and if you constantly feed previously unseen input, the cache will grow endlessly. Introducing an upper limit for it's size is almost like clearing the cache: the next unseen input will require ATN configurations to be created again and again, even if you repeat that input from then on. Without such a configuration the parser and lexer cannot do their job.

By design? Suppose I do my parsing at the beginning of the executable and then know I will never invoke ANTLR again. There is no more unseen input left to consider. What is the justification for keeping the cache around in that common use case? IMO, this use case is extremely common.

mike-lischke commented 6 months ago

@MarkBHuawei If you only parse once there's no reason to keep the DFA cache and you can reset it without any bad consequences. But that's not the normal case. Usually a parser is used to constantly parse input (for example the input for an SQL server or a code transformer).

Parsing without DFA cache is much slower (often 10 times slower), but doesn't need any extra memory. ANTLR allows to specify the range of input symbols that are cached. Unfortunately, lower and upper bounds are hard coded in the source code, so changing that would require to re-build your lexer.

In the new antlr4ng TypeScript runtime I went a different route by allowing to specify these values at runtime using lexer options. By specifying values like (-1, -1) you can effectively disable the DFA cache (at the higher parsing cost, as mentioned previously).