earwig / mwparserfromhell

A Python parser for MediaWiki wikicode
https://mwparserfromhell.readthedocs.io/
MIT License
741 stars 74 forks source link

`filter_templates` does not find matches when template has a linebreak #293

Open Gonzalo933 opened 1 year ago

Gonzalo933 commented 1 year ago

When filtering templates, if the template name has a linebreak there is no way of filtering it with the matches param.

For example, given this text from the source: https://en.wikipedia.org/wiki/Cholera

text = """
{{short description|Bacterial infection of the small intestine}}
{{About|the bacterial disease|the dish|Cholera (food)}}
{{pp-vandalism|small=yes}}
{{Infobox medical condition (new)
| name            = Cholera
| synonyms        = Asiatic cholera, epidemic cholera<ref name=textbook/>
| image           = PHIL 1939 lores.jpg
| caption         = A person with severe [[dehydration]] due to cholera, causing sunken eyes and wrinkled hands and skin.
| field           = [[Infectious disease (medical specialty)|Infectious disease]]
| symptoms        = Large amounts of watery [[diarrhea]], [[vomiting]], [[muscle cramps]]<ref name=WHO2010 /><ref name=CDC2015Pro />
| complications   = [[Dehydration]], [[electrolyte imbalance]]<ref name=WHO2010 />
| onset           = 2 hours to 5 days after exposure<ref name=CDC2015Pro />
| duration        = A few days<ref name=WHO2010 />
| causes          = ''[[Vibrio cholerae]]'' spread by [[fecal-oral route]]<ref name=WHO2010 /><ref name=Fink2016 />
| risks           = Poor [[sanitation]], not enough clean [[drinking water]], [[poverty]]<ref name=WHO2010 />
| diagnosis       = [[Stool test]]<ref name=WHO2010 />
| differential    =
| prevention      = Improved sanitation, [[drinking water|clean water]], [[hand washing]], [[cholera vaccine]]s<ref name=WHO2010 /><ref name=Lancet2012 />
| treatment       = [[Oral rehydration therapy]], [[zinc supplementation]], [[intravenous fluids]], [[antibiotics]]<ref name=WHO2010 /><ref name=CDC2014Zinc />
| medication      =
| prognosis       = Less than 1% mortality rate with proper treatment, untreated mortality rate 50-60%
| frequency       = 3–5&nbsp;million people a year<ref name=WHO2010 />
| deaths          = 28,800 (2015)<ref name=GBD2015De/>
}}

'''Cholera''' is an [[infection]] of the [[small intestine]] by some [[strain (biology)|strains]] of the [[Bacteria|bacterium]] ''[[Vibrio cholerae]]''.<ref name=Fink2016>{{cite book |last1=Finkelstein |first1=Richard A. |chapter=Cholera, ''Vibrio cholerae'' O1 and O139, and Other Pathogenic Vibrios |pmid=21413330 |id={{NCBIBook2|NBK8407}} |editor1-last=Baron |editor1-first=Samuel |title=Medical Microbiology |date=1996 |publisher=University of Texas Medical Branch at Galveston |isbn=978-0-9631172-1-2 |edition=4th }}</ref><ref name=CDC2015Pro /> Symptoms may range from none, to mild, to severe.<ref name=CDC2015Pro>{{cite web|title=Cholera – Vibrio cholerae infection Information for Public Health & Medical Professionals|url=https://www.cdc.gov/cholera/healthprofessionals.html|publisher=[[Centers for Disease Control and Prevention]]|access-date=17 March 2015|date=January 6, 2015|url-status=live|archive-url=https://web.archive.org/web/20150320052724/http://www.cdc.gov/cholera/healthprofessionals.html|archive-date=20 March 2015}}</ref> The classic symptom is large amounts of watery [[diarrhea]] that lasts a few days.<ref name=WHO2010 /> [[Vomiting]] and [[muscle cramps]] may also occur.<ref name=CDC2015Pro /> Diarrhea can be so severe that it leads within hours to severe [[dehydration]] and [[electrolyte imbalance]].<ref name=WHO2010 /> This may result in [[Enophthalmia|sunken eyes]], cold skin, decreased skin elasticity, and wrinkling of the hands and feet.<ref name=Lancet2012>{{cite journal | vauthors = Harris JB, LaRocque RC, Qadri F, Ryan ET, Calderwood SB | title = Cholera | journal = Lancet | volume = 379 | issue = 9835 | pages = 2466–2476 | date = June 2012 | pmid = 22748592 | pmc = 3761070 | doi = 10.1016/s0140-6736(12)60436-x }}</ref> Dehydration can cause the skin to turn [[cyanosis|bluish]].<ref>{{cite book|last1=Bailey|first1=Diane  | name-list-style = vanc |title=Cholera|date=2011|publisher=Rosen Pub.|location=New York|isbn=978-1-4358-9437-2|page=7|edition=1st|url=https://books.google.com/books?id=7rvLPx33GPgC&pg=PA7|url-status=live|archive-url=https://web.archive.org/web/20161203190215/https://books.google.com/books?id=7rvLPx33GPgC&pg=PA7|archive-date=2016-12-03}}</ref> Symptoms start two hours to five days after exposure.<ref name=CDC2015Pro />
"""

Some templates are correctly found:

import mwparserfromhell

print([t.name for t in mwparserfromhell.parse(text).filter_templates()])

['short description', 'About', 'pp-vandalism', 'Infobox medical condition (new)\n', 'cite book ', 'NCBIBook2', 'cite web', 'cite journal ', 'cite book']

but when trying to match the "Infobox medical condition (new)" one the filter does not work.

mwparserfromhell.parse(text).filter_templates(matches="Infobox medical condition (new)")

[]

lahwaacz commented 1 year ago

From the documentation:

matches can be used to further restrict the nodes, either as a function (taking a single Node and returning a boolean) or a regular expression (matched against the node’s string representation with re.search()). If matches is a regex, the flags passed to re.search() are re.IGNORECASE, re.DOTALL, and re.UNICODE, but custom flags can be specified by passing flags.

So your matches="Infobox medical condition (new)", taken as a regex, does not match the final \n in the template name. Note that this is different from the Wikicode.matches method. To filter with the latter, use:

mwparserfromhell.parse(text).filter_templates(matches=lambda template: template.name.matches("Infobox medical condition (new)"))