earwig / mwparserfromhell

A Python parser for MediaWiki wikicode
https://mwparserfromhell.readthedocs.io/
MIT License
741 stars 74 forks source link

How to keep formulas when parsing #318

Closed maxjeblick closed 6 months ago

maxjeblick commented 6 months ago

I noticed the Wikipedia parser https://huggingface.co/datasets/wikipedia/blob/main/wikipedia.py deletes formulas such as
<math>a + bi</math> in the article ... every complex number can be expressed in the form <math>a + bi</math>, where ..

I wonder how to keep these while cleaning the text as in the original script otherwise. I tried to modify the section.strip_code() part below but wasn't able to include the formulas correctly. Any help appreciated!

Minimal example:

import pywikibot
site = pywikibot.Site('en', 'wikipedia')  # The site we want to run our bot on
page = pywikibot.Page(site, 'Complex_number')

# this is adapted from https://huggingface.co/datasets/wikipedia/blob/main/wikipedia.py
import bz2
import codecs
import json
import re
import xml.etree.cElementTree as etree
from urllib.parse import quote

MEDIA_ALIASES = dict()
CAT_ALIASES = dict()

def parse_and_clean_wikicode(raw_content, parser, language):
    """Strips formatting and unwanted sections from raw page content."""
    wikicode = parser.parse(raw_content)

    # Filters for magic words that are parser instructions -- e.g., __NOTOC__
    re_rm_magic = re.compile("__[A-Z]*__", flags=re.UNICODE)

    # Filters for file/image links.
    media_prefixes = "|".join(["File", "Image", "Media"] + MEDIA_ALIASES.get(language, []))
    re_rm_wikilink = re.compile(f"^(?:{media_prefixes}):", flags=re.IGNORECASE | re.UNICODE)

    def rm_wikilink(obj):
        return bool(re_rm_wikilink.match(str(obj.title)))

    # Filters for references and tables
    def rm_tag(obj):
        return str(obj.tag) in {"ref", "table"}

    # Leave category links in-place but remove the category prefixes
    cat_prefixes = "|".join(["Category"] + CAT_ALIASES.get(language, []))
    re_clean_wikilink = re.compile(f"^(?:{cat_prefixes}):", flags=re.IGNORECASE | re.UNICODE)

    def is_category(obj):
        return bool(re_clean_wikilink.match(str(obj.title)))

    def clean_wikilink(obj):
        text = obj.__strip__()
        text = re.sub(re_clean_wikilink, "", text)
        obj.text = text

    def try_replace_obj(obj):
        try:
            clean_wikilink(obj)
        except ValueError:
            # For unknown reasons, objects are sometimes not found.
            pass

    def try_remove_obj(obj, section):
        try:
            section.remove(obj)
        except ValueError:
            # For unknown reasons, objects are sometimes not found.
            pass

    section_text = []
    # Filter individual sections to clean.
    for section in wikicode.get_sections(flat=True, include_lead=True, include_headings=True):
        for obj in section.ifilter_wikilinks(recursive=True):
            if rm_wikilink(obj):
                try_remove_obj(obj, section)
            elif is_category(obj):
                try_replace_obj(obj)
        for obj in section.ifilter_tags(matches=rm_tag, recursive=True):
            try_remove_obj(obj, section)

        section_text.append(re.sub(re_rm_magic, "", section.strip_code().strip()))
    return "\n\n".join(section_text)

print(parse_and_clean_wikicode(page.text, mwparserfromhell, language="en"))

which gives

In mathematics, a complex number is an element of a number system that extends the real numbers with a specific element denoted , called the imaginary unit and satisfying the equation ; every complex number can be expressed in the form , where  and  are real numbers. Because no real number satisfies the above equation,  was called an imaginary number by René Descartes. For the complex number   is called the , and  is called the . The set of complex numbers is denoted by either of the symbols  or . Despite the historical nomenclature "imaginary", complex numbers are regarded in the mathematical sciences as just as "real" as the real numbers and are fundamental in many aspects of the scientific description of the natural world.
....
maxjeblick commented 6 months ago

I noticed that math formulas will be excluded by the hardcoded PARSER_BLACKLIST. Would be nice if those would be configurable via some common settings (that can be changed from the user side).

earwig commented 6 months ago

You don't actually want to change PARSER_BLACKLIST; that means the tag doesn't contain wikicode (which it doesn't) to avoid confusing the parser during the initial parse.

What you want to do is remove math from INVISIBLE_TAGS, just below where you linked to. That doesn't affect the initial parse, just the behavior of strip_code.

Adding more configuration to strip_code to allow customizing the visible tags (or generically whether a given node should be visible) is a good idea. In the meantime, calling mwparserfromhell.definitions.INVISIBLE_TAGS.remove("math") somewhere in your code—while inelegant—should do what you want:

>>> import mwparserfromhell
>>> mwparserfromhell.definitions.INVISIBLE_TAGS.remove("math")
>>> c = mwparserfromhell.parse("foo <math>a + b</math> bar")
>>> c.strip_code()
'foo a + b bar'
maxjeblick commented 6 months ago

Thanks a lot for the quick answer! Will close, as the workaround is sufficient for my purposes.