earwig / mwparserfromhell

A Python parser for MediaWiki wikicode
https://mwparserfromhell.readthedocs.io/
MIT License
741 stars 74 forks source link

Strange template parsing bug in deeply nested scenario #313

Closed davidebbo closed 8 months ago

davidebbo commented 8 months ago

Using version 0.6.5.

I'm running into a puzzling issue in a nested scenario. The code below is a complete repro case. For some reason, it thinks that p15 and p21 are properties on the same template, when p21 belongs to a child template.

Interesting observations:

I started from a far more complex repro case, and this is as much as I could simplify it.

import mwparserfromhell

input = """
{{tpl
  |p1={{tpl
    |p2={{tpl
      |p3={{tpl
        |p4={{tpl
          |p5={{tpl
            |p6={{tpl
              |p7={{tpl
                |p8={{tpl
                  |p9={{tpl
                    |p10={{tpl
                      |p11={{tpl
                        |p12={{tpl
                          |p13={{tpl
                            |p14={{tpl
                              |p15={{tpl
                                |p16={{tpl
                                  |p17={{tpl
                                    |p18={{tpl
                                      |p19={{tpl
                                        |p20={{tpl }}
                                      }}
                                    }}
                                  }}
                                }}
                                |p21=1
                                }}
                              |p22=1}} }} }} }} }} }} }} }} }} }} }} }} }} }} }}
"""

wikicode = mwparserfromhell.parse(input)
template = wikicode.filter_templates(matches=lambda t: t.has_param("p15"))[0]

# It ends up with ['p15', 'p21'], which is not correct, as p21 is more deeply nested than p15 (p21 is with p16)
print([p.name for p in template.params])
davidebbo commented 8 months ago

I debugged it and found the problem: it's exceeding the MAX_DEPTH, which is hard coded at 40, both in the C tokenizer and the Python tokenizer.

Any chance of bumping up the limit, or making it configurable? Note that this is not an artificial scenario, as it is occurring in https://en.wikipedia.org/w/index.php?title=Ankylosauria, in the second cladogram. This particular page would need a depth limit of 60. Thanks!

Side suggestion: it may be best to hard fail by default when the max depth is reached, to make it easier to understand what's going on. But for now I'll be perfectly happy with a limit bump :)

davidebbo commented 8 months ago

I sent a PR and set the limit to 100, which appears to be what mediawiki is now using.

earwig commented 8 months ago

Thanks for looking into this! For context, the choice of 40 came from https://m.mediawiki.org/wiki/Manual:$wgMaxTemplateDepth. As you say, this has been set to 100 since 2022, so we should update it. I’ll review your PR when I get home.On Jan 3, 2024, at 3:39 AM, David Ebbo @.***> wrote: I sent a PR and set the limit to 100, which appears to be what mediawiki is now using.

—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you are subscribed to this thread.Message ID: @.***>