Memory leak [antlr4] - Lot of files

rollin-s commented 7 years ago

Hello !

I'm actually working with antlr4 to parse pl-sql and tsql files. I have to parse 1K to 10k files, but when I parse them, my RAM increase at the speed of light and the program freeze when I don't have RAM anymore.

I'm generating the lexer and parser this way : ` this.parser = new plsqlParser((TokenStream) null); this.lexer = new plsqlLexer((CharStream) null);

        this.lexer.setInterpreter(new LexerATNSimulator(lexer, lexer.getATN(), lexer.getInterpreter().decisionToDFA, new PredictionContextCache()));
        this.parser.setInterpreter(new ParserATNSimulator(parser, parser.getATN(), parser.getInterpreter().decisionToDFA, new PredictionContextCache()));

`

and everytime I parse another file, I'm cleaning the parser and the parserATN this way : ` lexer.reset(); parser.reset();

    ParserATNSimulator parserATN = new ParserATNSimulator(parser, parser.getATN(), parser.getInterpreter().decisionToDFA, new PredictionContextCache());

    parserATN.clearDFA();
    lexer.setInterpreter(new LexerATNSimulator(lexer, lexer.getATN(), lexer.getInterpreter().decisionToDFA, new PredictionContextCache()));
    parser.setInterpreter(new ParserATNSimulator(parser, parser.getATN(), parser.getInterpreter().decisionToDFA, new PredictionContextCache()));

`

But this doesn't fix my problem at all, my RAM si increasing VERY Fast (100 Mb/10seconds)

Is there anything I didn't get well ?

Thank you for your answer, Sacha

sharwell commented 7 years ago

The amount of memory required to use ANTLR 4 effectively is heavily dependent on the grammar and input. Long lookahead sequence with branching grammars tend to produce the highest memory usage. Some grammars which are written for readability over speed are known to require many gigabytes of memory in practice. As a first step, try increasing the amount of memory provided to the JVM, e.g. by using -Xmx12g. This will get things working well, allowing you to work separately to rewrite the grammar to reduce the memory requirements until they work well within your target usage goals.

sharwell commented 7 years ago

:memo: Regarding parserATN.clearDFA();, that call is roughly equivalent to parserATN.pleaseRunVerySlow();. Like calls to System.gc(), it should really be avoided.

rollin-s commented 7 years ago

Oh, I see ! I used parserATN.clearDFA() to try to optimise memory, but nothing changed ! And my RAM was limited by my computer, not JVM, so no problem on this side !

The problem is that it'll be difficult for me to optimise grammar. I'm using 2K lines grammar to parse perfectly PL and transact, so I guess I will not be able to more optimise memory isn't it ?

So, I guess my only choice is to work on heavy RAM machine ?

Thank you for your answer ! :)

sharwell commented 7 years ago

@rollin-s One option is using my optimized release of ANTLR 4 instead of the standard release. It includes many optimizations to reduce memory usage that are described here. There are only a small handful of grammars which need these optimizations, making it rather impractical to go through the substantial effort of trying to migrate the features to each of the various target languages. The default configuration for my release works well for most cases; the one you might want to play with first is described for Tail call elimination.

rollin-s commented 7 years ago

i'm gonna check that ! Thanks !

One last question, and I'll leave you alone ! I'm parsing a lot of files, but every file is independant, and everytime I'm changing file, my memory usage is still increasing.. I can't really see where does this come from.. I'm switching file, closing it, and not using his tree and sending it into my DB, so my memory should be a little bit released.. but it's still increasing ?

What didn't I get ?

Thanks again, and sorry for my lack of english ! :)

sharwell commented 7 years ago

ANTLR 4 maintains a cache for all the kinds of decisions it's been forced to make over time. When you give it a new file, it may need to make thousands of decisions where nearly all are similar to something seen previously but a small number of new scenarios are encountered. Each time this occurs those new scenarios are added to the cache.

rollin-s commented 7 years ago

Oh, I get it now ! Thanks again, That makes sense actually ! You have really helped me ! Have a nice day, thanks you !

antlr / antlr4

Memory leak [antlr4] - Lot of files #1944