earwig / mwparserfromhell

A Python parser for MediaWiki wikicode
https://mwparserfromhell.readthedocs.io/
MIT License
741 stars 74 forks source link

Parsing bug causing filter_templates() to ignore some templates #274

Closed SandKingg closed 3 years ago

SandKingg commented 3 years ago

There is a bug in the parser where '{{' fails to tokenize to TemplateOpen(), causing filter_templates() to not recognize it as a template.

You can see an example by parsing and then filtering the templates on https://en.wikipedia.org/wiki/Monitor_lizard where the 'automatic taxobox' template is completely ignored by filter_templates().

JJMC89 commented 3 years ago

It is because of the unclosed italic markup inside the template.

You'll need to fix the markup or use skip_style_tags=True when parsing.

>>> import pywikibot
>>> import mwparserfromhell
>>> site = pywikibot.Site('en', 'wikipedia')
>>> page = pywikibot.Page(site, 'Monitor lizard')
>>> wc = mwparserfromhell.parse(page.text, skip_style_tags=True)
>>> wc.filter_templates()[1]
"{{Automatic taxobox\n | name = Monitor lizard\n | fossil_range = [[Miocene]] to present\n | image = Lace Monitor in Tamborine National Park, Cedar Creek Falls, Queensland, Australia.jpg\n | image_caption = [[Lace monitor]] (''Varanus varius'')\n | taxon = Varanus\n | authority = [[Blasius Merrem|Merrem]], 1820\n | type_species = ''[[Varanus varius]]''\n | type_species_authority = [[George Shaw|Shaw]], 1790\n | subdivision_ranks = Subgenera\n | subdivision = {{plainlist|\n* ''Empagusia''\n* ''Euprepiosaurus''\n* ''[[Odatria]]''\n* ''[[Varanus_(Hapturosaurus)|Hapturosaurus]]\n* ''[[Varanus salvadorii|Papusaurus]]''\n* ''Philippinosaurus''\n* ''Polydaedalus''\n* ''Psammosaurus''\n* ''[[Varanus spinulosus|Solomonsaurus]]''\n* ''Soterosaurus''\n* ''Varanus''\n}}\n | range_map = Worldwidevaranus.PNG\n | range_map_caption = Combined native range of all the monitor lizards\n}}"
SandKingg commented 3 years ago

I never noticed that, thanks for pointing it out. At least it's an easy fix!