earwig / mwparserfromhell

A Python parser for MediaWiki wikicode
https://mwparserfromhell.readthedocs.io/
MIT License
741 stars 74 forks source link

Add missing TagData_dealloc #303

Closed declerambaul closed 1 year ago

declerambaul commented 1 year ago

Fix for the memory leak that causes an issue on certain wikitext (e.g. lots of nested tags, often vandalism), see https://github.com/earwig/mwparserfromhell/issues/286.

I didn't investigate the algorithm itself, and while the leak was due to a missing dealloc, it still takes a long time to parse these problematic wikitexts. I am attaching some details below, it might be helpful if somebody has a look at how to optimize the search tree algorithm itself.

Created a script that parses a single "bad" wikitext (the one from the linked issue).

wikitext = open('bad.txt').read()
from mwparserfromhell.parser._tokenizer import CTokenizer
tok = CTokenizer()
tok.tokenize(wikitext)

The valgrind logs point to the problematic method (PYTHONMALLOC=malloc valgrind --leak-check=yes --track-origins=yes --log-file=valgrind-log.txt python helling.py).

==17814== 24,718,216 (2,038,992 direct, 22,679,224 indirect) bytes in 42,479 blocks are definitely lost in loss record 8,248 of 8,248
==17814==    at 0x483877F: malloc (in /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_memcheck-amd64-linux.so)
==17814==    by 0x8DBF003: TagData_new (tag_data.c:39)
==17814==    by 0x8DC3BBC: Tokenizer_really_parse_tag (tok_parse.c:1765)
==17814==    by 0x8DC4198: Tokenizer_parse_tag (tok_parse.c:1910)
==17814==    by 0x8DC13F9: Tokenizer_parse (tok_parse.c:2997)
==17814==    by 0x8DC13F9: Tokenizer_parse (tok_parse.c:2858)
==17814==    by 0x8DC4198: Tokenizer_parse_tag (tok_parse.c:1910)
==17814==    by 0x8DC13F9: Tokenizer_parse (tok_parse.c:2997)
==17814==    by 0x8DC13F9: Tokenizer_parse (tok_parse.c:2858)
==17814==    by 0x8DC2249: Tokenizer_parse_bold (tok_parse.c:2015)
==17814==    by 0x8DC2249: Tokenizer_parse_style (tok_parse.c:2176)
==17814==    by 0x8DC2249: Tokenizer_parse (tok_parse.c:3006)
==17814==    by 0x8DC2249: Tokenizer_parse (tok_parse.c:2858)
==17814==    by 0x8DC4198: Tokenizer_parse_tag (tok_parse.c:1910)
==17814==    by 0x8DC13F9: Tokenizer_parse (tok_parse.c:2997)
==17814==    by 0x8DC13F9: Tokenizer_parse (tok_parse.c:2858)
==17814==    by 0x8DC4198: Tokenizer_parse_tag (tok_parse.c:1910)
==17814==    by 0x8DC13F9: Tokenizer_parse (tok_parse.c:2997)
==17814==    by 0x8DC13F9: Tokenizer_parse (tok_parse.c:2858)
earwig commented 1 year ago

Thank you!

elukey commented 11 months ago

@declerambaul thanks a lof from the ML Team @ Wikimedia, we were investigating a memory leak in our model servers using mwparserfromhell and once we found the culprit it was really nice to see that a fix was already in place and ready to go. It would have taken ages to us to provide a fix like yours!

Thanks also to @earwig for the quick review/release cycle!