earwig / mwparserfromhell

A Python parser for MediaWiki wikicode
https://mwparserfromhell.readthedocs.io/
MIT License
741 stars 74 forks source link
mediawiki parser python wikipedia

mwparserfromhell

.. image:: https://img.shields.io/coveralls/earwig/mwparserfromhell/main.svg :alt: Coverage Status :target: https://coveralls.io/r/earwig/mwparserfromhell

mwparserfromhell (the MediaWiki Parser from Hell) is a Python package that provides an easy-to-use and outrageously powerful parser for MediaWiki_ wikicode. It supports Python 3.8+.

Developed by Earwig with contributions from Σ, Legoktm, and others. Full documentation is available on ReadTheDocs. Development occurs on GitHub_.

Installation

The easiest way to install the parser is through the Python Package Index; you can install the latest release with pip install mwparserfromhell (get pip). Make sure your pip is up-to-date first, especially on Windows.

Alternatively, get the latest development version:

.. code-block:: sh

git clone https://github.com/earwig/mwparserfromhell.git
cd mwparserfromhell
python setup.py install

The comprehensive unit testing suite requires pytest_ (pip install pytest) and can be run with python -m pytest.

Usage

Normal usage is rather straightforward (where text is page text):

.. code-block:: python

import mwparserfromhell wikicode = mwparserfromhell.parse(text)

wikicode is a mwparserfromhell.Wikicode object, which acts like an ordinary str object with some extra methods. For example:

.. code-block:: python

text = "I has a template! {{foo|bar|baz|eggs=spam}} See it?" wikicode = mwparserfromhell.parse(text) print(wikicode) I has a template! {{foo|bar|baz|eggs=spam}} See it? templates = wikicode.filter_templates() print(templates) ['{{foo|bar|baz|eggs=spam}}'] template = templates[0] print(template.name) foo print(template.params) ['bar', 'baz', 'eggs=spam'] print(template.get(1).value) bar print(template.get("eggs").value) spam

Since nodes can contain other nodes, getting nested templates is trivial:

.. code-block:: python

text = "{{foo|{{bar}}={{baz|{{spam}}}}}}" mwparserfromhell.parse(text).filter_templates() ['{{foo|{{bar}}={{baz|{{spam}}}}}}', '{{bar}}', '{{baz|{{spam}}}}', '{{spam}}']

You can also pass recursive=False to filter_templates() and explore templates manually. This is possible because nodes can contain additional Wikicode objects:

.. code-block:: python

code = mwparserfromhell.parse("{{foo|this {{includes a|template}}}}") print(code.filter_templates(recursive=False)) ['{{foo|this {{includes a|template}}}}'] foo = code.filter_templates(recursive=False)[0] print(foo.get(1).value) this {{includes a|template}} print(foo.get(1).value.filter_templates()[0]) {{includes a|template}} print(foo.get(1).value.filter_templates()[0].get(1).value) template

Templates can be easily modified to add, remove, or alter params. Wikicode objects can be treated like lists, with append(), insert(), remove(), replace(), and more. They also have a matches() method for comparing page or template names, which takes care of capitalization and whitespace:

.. code-block:: python

text = "{{cleanup}} '''Foo''' is a [[bar]]. {{uncategorized}}" code = mwparserfromhell.parse(text) for template in code.filter_templates(): ... if template.name.matches("Cleanup") and not template.has("date"): ... template.add("date", "July 2012") ... print(code) {{cleanup|date=July 2012}} '''Foo''' is a [[bar]]. {{uncategorized}} code.replace("{{uncategorized}}", "{{bar-stub}}") print(code) {{cleanup|date=July 2012}} '''Foo''' is a [[bar]]. {{bar-stub}} print(code.filter_templates()) ['{{cleanup|date=July 2012}}', '{{bar-stub}}']

You can then convert code back into a regular str object (for saving the page!) by calling str() on it:

.. code-block:: python

text = str(code) print(text) {{cleanup|date=July 2012}} '''Foo''' is a [[bar]]. {{bar-stub}} text == code True

Limitations

While the MediaWiki parser generates HTML and has access to the contents of templates, among other things, mwparserfromhell acts as a direct interface to the source code only. This has several implications:

Additionally, the parser lacks awareness of certain wiki-specific settings:

Integration

mwparserfromhell is used by and originally developed for EarwigBot_; Page objects have a parse method that essentially calls mwparserfromhell.parse() on page.get().

If you're using Pywikibot_, your code might look like this:

.. code-block:: python

import mwparserfromhell
import pywikibot

def parse(title):
    site = pywikibot.Site()
    page = pywikibot.Page(site, title)
    text = page.get()
    return mwparserfromhell.parse(text)

If you're not using a library, you can parse any page with the following Python 3 code (using the API and the requests library):

.. code-block:: python

import mwparserfromhell
import requests

API_URL = "https://en.wikipedia.org/w/api.php"

def parse(title):
    params = {
        "action": "query",
        "prop": "revisions",
        "rvprop": "content",
        "rvslots": "main",
        "rvlimit": 1,
        "titles": title,
        "format": "json",
        "formatversion": "2",
    }
    headers = {"User-Agent": "My-Bot-Name/1.0"}
    req = requests.get(API_URL, headers=headers, params=params)
    res = req.json()
    revision = res["query"]["pages"][0]["revisions"][0]
    text = revision["slots"]["main"]["content"]
    return mwparserfromhell.parse(text)

.. _MediaWiki: https://www.mediawiki.org .. _ReadTheDocs: https://mwparserfromhell.readthedocs.io .. _Earwig: https://en.wikipedia.org/wiki/User:The_Earwig .. _Σ: https://en.wikipedia.org/wiki/User:%CE%A3 .. _Legoktm: https://en.wikipedia.org/wiki/User:Legoktm .. _GitHub: https://github.com/earwig/mwparserfromhell .. _Python Package Index: https://pypi.org/ .. _get pip: https://pypi.org/project/pip/ .. _pytest: https://docs.pytest.org/ .. _Word-ending links: https://www.mediawiki.org/wiki/Help:Links#linktrail .. _EarwigBot: https://github.com/earwig/earwigbot .. _Pywikibot: https://www.mediawiki.org/wiki/Manual:Pywikibot .. _API: https://www.mediawiki.org/wiki/API:Main_page .. _requests: https://2.python-requests.org