antlr / antlr4

ANTLR (ANother Tool for Language Recognition) is a powerful parser generator for reading, processing, executing, or translating structured text or binary files.
http://antlr.org
BSD 3-Clause "New" or "Revised" License
17.3k stars 3.3k forks source link

[Cpp] Memory leaks in the C++ runtime #4309

Open ghost opened 1 year ago

ghost commented 1 year ago

I'm developing a compiler for my language on Visual Studio 2022 with ANTLR 4.13.0 (Flex & Bison previously) and the CRT reports memory leaks after the compiler exited without allocation source information.

Partial outputs:

Detected memory leaks!
Dumping objects ->
{11750} normal block at 0x0000000000436E90, 128 bytes long.
 Data: <  C       C     > 90 04 43 00 00 00 00 00 90 04 43 00 00 00 00 00 
{11749} normal block at 0x00000000004202B0, 16 bytes long.
 Data: <`iC             > 60 69 43 00 00 00 00 00 00 00 00 00 00 00 00 00 
{11748} normal block at 0x0000000000430490, 24 bytes long.
 Data: <  C       C     > 90 04 43 00 00 00 00 00 90 04 43 00 00 00 00 00 
{11747} normal block at 0x000000000041FDB0, 16 bytes long.
 Data: <HiC             > 48 69 43 00 00 00 00 00 00 00 00 00 00 00 00 00 
{11746} normal block at 0x0000000000436F50, 128 bytes long.
 Data: <  C       C     > D0 09 43 00 00 00 00 00 D0 09 43 00 00 00 00 00 
{11745} normal block at 0x000000000041FB30, 16 bytes long.
 Data: < hC             > E8 68 43 00 00 00 00 00 00 00 00 00 00 00 00 00 
{11744} normal block at 0x00000000004309D0, 24 bytes long.
 Data: <  C       C     > D0 09 43 00 00 00 00 00 D0 09 43 00 00 00 00 00 
{11743} normal block at 0x000000000041F310, 16 bytes long.
 Data: < hC             > D0 68 43 00 00 00 00 00 00 00 00 00 00 00 00 00 
{11742} normal block at 0x0000000000439B90, 128 bytes long.
 Data: <0 C     0 C     > 30 04 43 00 00 00 00 00 30 04 43 00 00 00 00 00 
{11741} normal block at 0x000000000041FBD0, 16 bytes long.
 Data: <phC             > 70 68 43 00 00 00 00 00 00 00 00 00 00 00 00 00 
(...)

I have tested for each statement to ensure there is no potential memory leaks in my code and I found that memory leaks appear after the lexer initialized:

std::string srcPath;
//
// ...
//
std::ifstream fs(srcPath, std::ios::binary);
antlr4::ANTLRInputStream is(fs);

// Memory leaks appear after this statement
SlakeLexer lexer(&is);

antlr4::CommonTokenStream tokens(&lexer);

SlakeParser parser(&tokens);

It seems like the lexer does not release resources properly during the deallocation, there is also an issue mentioned a similar problem: #4099.

ghost commented 1 year ago

Source of the lexer:

lexer grammar SlakeLexer;

COMMA: ',';
QUESTION: '?';
COLON: ':';
SEMICOLON: ';';
LBRACKET: '[';
RBRACKET: ']';
LBRACE: '{';
RBRACE: '}';
LPARENTHESE: '(';
RPARENTHESE: ')';
AT: '@';
DOT: '.';
VARARG: '...';

OP_ADD: '+';
OP_SUB: '-';
OP_MUL: '*';
OP_DIV: '/';
OP_MOD: '%';
OP_AND: '&';
OP_OR: '|';
OP_XOR: '^';
OP_NOT: '!';
OP_REV: '~';
OP_ASSIGN: '=';
OP_ASSIGN_ADD: '+=';
OP_ASSIGN_SUB: '-=';
OP_ASSIGN_MUL: '*=';
OP_ASSIGN_DIV: '/=';
OP_ASSIGN_MOD: '%=';
OP_ASSIGN_AND: '&=';
OP_ASSIGN_OR: '|=';
OP_ASSIGN_XOR: '^=';
OP_ASSIGN_REV: '~=';
OP_ASSIGN_LSH: '<<=';
OP_ASSIGN_RSH: '>>=';
OP_SWAP: '<=>';

OP_EQ: '==';
OP_NEQ: '!=';
OP_STRICTEQ: '===';
OP_STRICTNEQ: '!==';
OP_LSH: '<<';
OP_RSH: '>>';
OP_LT: '<';
OP_GT: '>';
OP_LTEQ: '<=';
OP_GTEQ: '>=';
OP_LAND: '&&';
OP_LOR: '||';
OP_INC: '++';
OP_DEC: '--';
OP_MATCH: '=>';
OP_WRAP: '->';
OP_SCOPE: '::';
OP_DOLLAR: '$';

KW_ASYNC: 'async';
KW_AWAIT: 'await';
KW_BASE: 'base';
KW_BREAK: 'break';
KW_CASE: 'case';
KW_CATCH: 'catch';
KW_CLASS: 'class';
KW_CONST: 'const';
KW_CONTINUE: 'continue';
KW_DELETE: 'delete';
KW_DEFAULT: 'default';
KW_ELIF: 'elif';
KW_ELSE: 'else';
KW_ENUM: 'enum';
KW_FALSE: 'false';
KW_FN: 'fn';
KW_FOR: 'for';
KW_FINAL: 'final';
KW_FINALLY: 'finally';
KW_IF: 'if';
KW_MODULE: 'module';
KW_NATIVE: 'native';
KW_NEW: 'new';
KW_NULL: 'null';
KW_OVERRIDE: 'override';
KW_OPERATOR: 'operator';
KW_PUB: 'pub';
KW_RETURN: 'return';
KW_STATIC: 'static';
KW_STRUCT: 'struct';
KW_SWITCH: 'switch';
KW_THIS: 'this';
KW_THROW: 'throw';
KW_TIMES: 'times';
KW_TRAIT: 'trait';
KW_TYPEOF: 'typeof';
KW_INTERFACE: 'interface';
KW_TRUE: 'true';
KW_TRY: 'try';
KW_USING: 'using';
KW_VAR: 'var';
KW_WHILE: 'while';
KW_YIELD: 'yield';

TN_I8: 'i8';
TN_I16: 'i16';
TN_I32: 'i32';
TN_I64: 'i64';
TN_ISIZE: 'isize';
TN_U8: 'u8';
TN_U16: 'u16';
TN_U32: 'u32';
TN_U64: 'u64';
TN_USIZE: 'usize';
TN_F32: 'f32';
TN_F64: 'f64';
TN_STRING: 'string';
TN_BOOL: 'bool';
TN_AUTO: 'auto';
TN_VOID: 'void';
TN_ANY: 'any';

L_INT: '0b' [01]+ | '0' [0-9]* | '0x' [0-9]+ | [1-9] [0-9]*;
L_UINT: L_INT [uU];
L_LONG: L_INT [lL];
L_ULONG: L_INT ( [uU][lL] | [lL][uU]);
L_F32: L_F64 [fF];
L_F64: [0-9]+ '.' ([0-9]+)?;
L_STRING: '"' CharSequence? '"';
L_RAWSTRING: '"""' (.)*? '"""';

ID: [a-zA-Z_][a-zA-Z0-9_]*;

fragment CharSequence: Char+;
fragment Char: StringEscape | ~["\\\r\n];
fragment StringEscape: SimpleEscape | OctEscape | HexEscape;

fragment SimpleEscape: '\\' [\\"rnt0];
fragment OctEscape: '\\' OctDigit OctDigit OctDigit;
fragment HexEscape: '\\' HexDigit HexDigit;

fragment OctDigit: [0-7];
fragment HexDigit: [0-9a-fA-F];

WHITESPACE: [ \t\r\n]+ -> skip;
COMMENT_BLK: '/*' .*? '*/' -> skip;
COMMENT_LINE: '//' ~ [\r\n]* -> skip;

and content of the input file:

class Base {
    pub i32 data = 0;

    operator new(i32 a) {
        println("Base Constructed");
    }

    operator delete() {
        println("Base Destructed");
    }
}

class Derived(@Base) {
    pub i32 data = 0;

    operator new(i32 a) {
        base.new(a * 2);
        println("Derived Constructed");
    }

    operator delete() {
        println("Derived Destructed");
    }

    pub void printMembers() {
        println("Base data: ", base.data);
        println("Derived data: ", data);
    }
}

pub i32 main() {
    @Base a = new @Base(123);

    return ++a.data;
}

(Because the parser does not affect the result, the source was not provided)

ghost commented 1 year ago

I have located where the problem originates (with the demo in runtime/Cpp/demo).

According to the log (complete log file is here), blocks allocated by codes from following files were not released correctly and cause memory leaks:

runtime/Cpp/runtime/src/atn/LexerATNSimulator.cpp(192)
runtime/Cpp/runtime/src/atn/LexerATNSimulator.cpp(295)
runtime/Cpp/runtime/src/atn/LexerATNSimulator.cpp(536)
runtime/Cpp/runtime/src/atn/ParserATNSimulator.cpp(299)
runtime/Cpp/runtime/src/atn/ParserATNSimulator.cpp(465)
runtime/Cpp/runtime/src/atn/ParserATNSimulator.cpp(531)
runtime/Cpp/runtime/src/atn/ParserATNSimulator.cpp(618)
runtime/Cpp/runtime/src/atn/ParserATNSimulator.cpp(636)
runtime/Cpp/runtime/src/dfa/DFA.cpp(29)
runtime/Cpp/runtime/src/atn/ATNDeserializer.cpp(179)
runtime/Cpp/runtime/src/atn/ATNDeserializer.cpp(182)
runtime/Cpp/runtime/src/atn/ATNDeserializer.cpp(185)
runtime/Cpp/runtime/src/atn/ATNDeserializer.cpp(188)
runtime/Cpp/runtime/src/atn/ATNDeserializer.cpp(191)
runtime/Cpp/runtime/src/atn/ATNDeserializer.cpp(194)
runtime/Cpp/runtime/src/atn/ATNDeserializer.cpp(197)
runtime/Cpp/runtime/src/atn/ATNDeserializer.cpp(200)
runtime/Cpp/runtime/src/atn/ATNDeserializer.cpp(203)
runtime/Cpp/runtime/src/atn/ATNDeserializer.cpp(206)
runtime/Cpp/runtime/src/atn/ATNDeserializer.cpp(212)
runtime/Cpp/runtime/src/atn/ATNDeserializationOptions.cpp(17)
runtime/Cpp/runtime/src/atn/LexerMoreAction.cpp(16)
runtime/Cpp/runtime/src/atn/LexerSkipAction.cpp(16)
runtime/Cpp/runtime/src/atn/LexerPopModeAction.cpp(16)

Currently, I have no idea about how to fix it.

jimidle commented 1 year ago

I think that you are just seeing the lexer deserialize the tables it uses internally for the DFA. This is allocated once for a lexer instance and I guess that it just isn't explicitly released.

It's not a leak in the sense that it will keep growing, though I suppose in the purest sense, it should be released explicitly. Maybe there is a call in there somewhere that will do that, but probably not.

On Sun, Jun 18, 2023 at 12:08 AM 匚艹 @.***> wrote:

I have located where the problem originates (with the demo in runtime/Cpp/demo https://github.com/antlr/antlr4/tree/dev/runtime/Cpp/demo).

According to the log (complete log file is here https://github.com/antlr/antlr4/files/11779486/dumped_leaks.log), blocks allocated by codes from following files were not released correctly and cause memory leaks:

runtime/Cpp/runtime/src/atn/LexerATNSimulator.cpp(192) runtime/Cpp/runtime/src/atn/LexerATNSimulator.cpp(295) runtime/Cpp/runtime/src/atn/LexerATNSimulator.cpp(536) runtime/Cpp/runtime/src/atn/ParserATNSimulator.cpp(299) runtime/Cpp/runtime/src/atn/ParserATNSimulator.cpp(465) runtime/Cpp/runtime/src/atn/ParserATNSimulator.cpp(531) runtime/Cpp/runtime/src/atn/ParserATNSimulator.cpp(618) runtime/Cpp/runtime/src/atn/ParserATNSimulator.cpp(636) runtime/Cpp/runtime/src/dfa/DFA.cpp(29) runtime/Cpp/runtime/src/atn/ATNDeserializer.cpp(179) runtime/Cpp/runtime/src/atn/ATNDeserializer.cpp(182) runtime/Cpp/runtime/src/atn/ATNDeserializer.cpp(185) runtime/Cpp/runtime/src/atn/ATNDeserializer.cpp(188) runtime/Cpp/runtime/src/atn/ATNDeserializer.cpp(191) runtime/Cpp/runtime/src/atn/ATNDeserializer.cpp(194) runtime/Cpp/runtime/src/atn/ATNDeserializer.cpp(197) runtime/Cpp/runtime/src/atn/ATNDeserializer.cpp(200) runtime/Cpp/runtime/src/atn/ATNDeserializer.cpp(203) runtime/Cpp/runtime/src/atn/ATNDeserializer.cpp(206) runtime/Cpp/runtime/src/atn/ATNDeserializer.cpp(212) runtime/Cpp/runtime/src/atn/ATNDeserializationOptions.cpp(17) runtime/Cpp/runtime/src/atn/LexerMoreAction.cpp(16) runtime/Cpp/runtime/src/atn/LexerSkipAction.cpp(16) runtime/Cpp/runtime/src/atn/LexerPopModeAction.cpp(16)

Currently, I have no idea about how to fix it.

— Reply to this email directly, view it on GitHub https://github.com/antlr/antlr4/issues/4309#issuecomment-1595795568, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAJ7TMHLPZOVW2XNGUNETCDXLXI57ANCNFSM6AAAAAAZBU427E . You are receiving this because you are subscribed to this thread.Message ID: @.***>

ghost commented 1 year ago

I found that the static data of the lexer and parser were not released correctly (DFA caches are also stored here), they will never be released after the allocation in xxxInitialize functions (in generated source files of lexer and parser).

So I tried to use unique_ptr instead of raw pointer for them (by modifying the codegen template) and then most of the leak prompts disappeared.

Detected memory leaks!
Dumping objects ->
C:\Users\Pyxherb\Desktop\antlr4\runtime\Cpp\runtime\src\atn\ATNDeserializationOptions.cpp(17) : {434} normal block at 0x000002891574C2C0, 3 bytes long.
 Data: <   > 00 01 00 
Object dump complete.

Now I think most of the prompts was caused by unreleased static data.

liu876151990 commented 1 year ago

This needs to be modified ! I'd like you to revise and submit. Thanks

ATNDeserializationOptions.cpp

const ATNDeserializationOptions& ATNDeserializationOptions::getDefaultOptions() { static const ATNDeserializationOptions const defaultOptions = new ATNDeserializationOptions(); return defaultOptions; }

ghost commented 1 year ago

Fixed memory leaks in ATNDeserializationOptions.