earwig / mwparserfromhell

A Python parser for MediaWiki wikicode
https://mwparserfromhell.readthedocs.io/
MIT License
741 stars 74 forks source link

Parser misses the infobox on the current Lombardy page (possibly because of a comment in the name?) #267

Closed ramayer closed 2 years ago

ramayer commented 3 years ago

The mwparserfromhell parser is missing some infoboxes, such as the one on the current Lombardy page ( https://en.wikipedia.org/wiki/Lombardy ).

I suspect it's probably because someone put a comment in the infobox 's first field like this:

{{Infobox settlement 
 < !-- See Template:Infobox settlement for additional fields and descriptions -- > | name                            = Lombardy 
 | official_name                   =  
 | native_name                     = {{native name|it|Lombardia}} < br/ > {{lang|lmo|Lombardia}} 
 | native_name_lang                =  
 | settlement_type                 = [[Region of Italy]] 
 ...
}}

This is the code I used. The table tmp_wikipedia contains just the original title and body from the wikipedia dump from last week.

lombardy = spark.sql('''select body from tmp_wikipedia where title = 'Lombardy' limit 1''').take(1)[0].asDict(True)
parsed = mwparserfromhell.parse(lombardy['body'])
parsed.filter_templates()

and the result is all templates on the page except the Infobox (which is arguably the most interesting template on the page).

ramayer commented 3 years ago

I see other issues where "skip_style_tags=True" is a workaround - but it didn't help in this case.

I modified my code to try:

 parsed = mwparserfromhell.parse(lombardy['body'],skip_style_tags=True)
 parsed.filter_templates()

and still don't see the infobox from Lombardy's page.

earwig commented 3 years ago

What version of mwparserfromhell are you using, and what revision ID of Lombardy are you trying to load? I don't have any problem parsing the infobox with the latest parser version on the current revision of that page.

ramayer commented 2 years ago

Thanks - it took a while for me to try again; but it's working for me now.