geohci / edit-types

Edit diffs and type detection for Wikipedia
MIT License
12 stars 3 forks source link

Build status

mwedittypes

Edit diffs and type detection for Wikipedia. The goal is to transform unstructured edits to Wikipedia articles into a structured summary of what actions were taken in the edit. The library has two major formats (and associated algorithms):

Installation

You can install mwedittypes with pip:

$ pip install mwedittypes

Example

If one revision of wikitext is as follows:

{{Short description|Austrian painter}}
'''Karl Josef Aigen''' (8 October 1684 – 22 October 1762) was a landscape painter, born at Olomouc.

and a second revision of wikitext is as follows:

{{Short description|Austrian landscape painter}}
'''Karl Josef Aigen''' (8 October 1684 – 22 October 1762) was a landscape painter, born at [[Olomouc]].

The changes that happened would be:

This repository would return this in the following structure:

Basic Usage

Simple:

>>> from mwedittypes import SimpleEditTypes
>>> prev_wikitext = '{{Short description|Austrian painter}}'
>>> curr_wikitext = '{{Short description|Austrian [[landscape painter]]}}'
>>> et = SimpleEditTypes(prev_wikitext, curr_wikitext, lang='en')
>>> et.get_diff()
{'Wikilink': {'insert': 1}, 'Template': {'change': 1}, 'Section': {'change': 1}}

Structured:

>>> from mwedittypes import StructuredEditTypes
>>> prev_wikitext = '{{Short description|Austrian painter}}'
>>> curr_wikitext = '{{Short description|Austrian [[landscape painter]]}}'
>>> et = StructuredEditTypes(prev_wikitext, curr_wikitext, lang='en')
>>> et.get_diff()
{'context': [Context(type='Section', edittype='change', count=1)],
 'node-edits': [NodeEdit(type='Wikilink', edittype='insert', section='0: Lede', name='landscape painter',
                         changes=[('title', None, 'landscape painter')]),
                NodeEdit(type='Template', edittype='change', section='0: Lede', name='Short description',
                         changes=[('parameter', ('1', 'Austrian painter'), ('1', 'Austrian [[landscape painter]]'))])],
 'text-edits': []}

In most cases (~90%), the two approaches agree in their overall results. They differ in the following situations:

A good example of a diff where they vary in outputs is revision 1107840666 on English Wikipedia (diff; model output).

Language Coverage

Almost everything in this library is language-agnostic and so works consistently for any language of Wikipedia. For links, the namespace identification varies but we use a list of prefixes that covers all languages (at the time of generation). Sentences are semi-challenging in that we must build a list of sentence-ending punctuation that covers all languages. We believe we have done a good job of this but have not explicitly tested this. The list can be found in mwedittypes/constants.py under SENTENCE_BREAKS_REGEX. Words are the most challenging aspect and the one place where you will see varying behavior. For them we take two strategies:

Known Issues

Wikitext/language is verrrrrrry complicated and so there are certain things we can't feasibly extract consistently. The ones we know about:

For links, we assume that if the prefix is not for media or a category, the link is a wikilink to namespace 0. This is generally reasonable for current versions of Wikipedia articles but would overload the Wikilink class with e.g., user page links on talk pages or interwiki links for older versions of articles.

Development

We are happy to receive contributions though will default to keeping the code here relatively general (not overly customized to individual use-cases). Please reach out or open an issue for the changes you would like to merge so that we can discuss beforehand.

Code Summary -- StructuredEditTypes

The code for computing diffs and running edit-type detection can be found in two files:

While the diffing/counting is not trivial, the trickiest part of the process is correctly parsing the wikitext into nodes (Templates, Wikilinks, etc.). This is almost all done via the amazing mwparserfromhell library with a few tweaks in the tree differ:

To accurately, but efficiently, describe the scale of textual changes in edits, we also use some regexes and heuristics to describe how much text was changed in an edit in the node differ. This is generally the toughest part of diffing text but because we do not need to visually describe the diff, just estimate the scale of how much changed, we can use relatively simple methods. To do this, we break down text changes into five categories and identify how much of each changed: paragraphs, sentences, words, punctuation, and whitespace.

Code Summary -- SimpleEditTypes

The code for computing diffs and running edit-type detection can be found in one file mwedittypes/simple_differ.py.

The bulk of the library parses a wikitext document into a bag of nodes (Templates, Wikilinks, etc.). This uses largely the same parsing approach as StructuredEditTypes

The diffing component simply takes the symmetric difference of the nodes associated with each wikitext document to identify what has changed and then summarizes the counts.

Testing

The tests for components are contained within the tests directory. They can be run via pytest. We are not even close to full coverage yet given the numerous node types (template, text, etc.) and four actions (insert/remove/change/move) and varying languages for e.g., Text or Category/Media nodes, but we are working on expanding coverage.

Releases

When a release is ready, there are a few simple steps to take:

Troubleshooting:

Documentation