earwig / mwparserfromhell

A Python parser for MediaWiki wikicode
https://mwparserfromhell.readthedocs.io/
MIT License
750 stars 75 forks source link

Bug: massive memory usage for parsing a revision #286

Open geohci opened 2 years ago

geohci commented 2 years ago

We've been doing some parsing of lots of historical Wikipedia revisions so are running into weird edge cases. One of these is the revision below that seems to somehow escape the recursion depth and consumes a massive amount of memory when parsing. Obviously a patch would be great but also I'm happy to hear if there are any ideas for easily killing the parsing before its memory usage gets out of hand (without knowing the problematic revision IDs in advance).

Details:

The wikitext itself isn't particularly long but it's a bunch of HTML tags all nested within links etc.: https://simple.wikipedia.org/w/index.php?title=2000s_(decade)&action=edit&oldid=69784

Code to replicate -- very quickly consumed memory and was killed on PAWs (3GB RAM max) but I'm told it gets up to about 30GB if that's available:

import mwapi
import mwparserfromhell
session = mwapi.Session('https://simple.wikipedia.org', user_agent='<contact info>')

def get_wikitext(revision_id, wiki='simple'):
    return (session.get(
        action='query',
        prop='revisions',
        rvprop='content',
        rvslots='*',
        revids=revision_id,
        formatversion='2'
    ))['query']['pages'][0]['revisions'][0]['slots']['main']['content']

bad_wikitext = get_wikitext(69784)
print(len(bad_wikitext))  # 80360
wikicode = mwparserfromhell.parse(bad_wikitext)  # blows up
lahwaacz commented 2 years ago

Related to #40

Is there some unclosed tag or wiki markup in that revision?

geohci commented 2 years ago

Is there some unclosed tag or wiki markup in that revision?

Thanks for chiming in. Yes, I do think that's what is going on but my understanding is that the max recursion depth (code) should prevent these unclosed tags from consuming so much memory? Perhaps it's because the wikitext has so many in this example (>1000 <big> tags and only about 300 closing </big> tags)? So in addition to the max recursion depth (which I assume is per item), there also should be a global max?

I guess that hints at my cheap fix too (check for number of open tags and number of close tags and make sure they are not too uneven before parsing) but that's not an easy check to add before parsing and has all sorts of edge cases (e.g., these tags don't have equivalent closing tags)

earwig commented 2 years ago

I guess the runaway memory usage is from our cache of which combinations of tag locations are known to be invalid. Probably we're storing that in a very inefficient way and doing unnecessary backtracking.

Thanks for the bug report and sorry for being unresponsive lately. I'll think more about this.

geohci commented 2 years ago

that explanation makes sense. so decreasing recursion depth would probably help but is a blunt solution for something that's really about how to manage the possible combinations for these complicated, unclosed tag situations. thanks in advance for taking a look!