5j9 / wikitextparser

A Python library to parse MediaWiki WikiText
GNU General Public License v3.0
289 stars 22 forks source link

plain_text() error on some pages #90

Closed oktie closed 3 years ago

oktie commented 3 years ago

I tried wikitextparser on a recent wikinews dump and plain_text() fails for a few pages, while mwparserfromhell's strip_code() works on all pages.

Here is an example to reproduce the error:

import pywikibot
import wikitextparser
en_wikinews = pywikibot.Site('en', 'wikinews')
text = pywikibot.Page(en_wikinews,"NASCAR's Earnhardt Jr Signs 5-year Contract with Hendrick Motorsports").get()
print(wikitextparser.parse(text).plain_text())

Using pywikibot (5.6.0) and wikitextparser 0.47.0 - I'm getting:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "~/venv/lib/python3.7/site-packages/wikitextparser/_wikitext.py", line 607, in plain_text
    for i in parsed.get_bolds_and_italics():
  File "~/venv/lib/python3.7/site-packages/wikitextparser/_wikitext.py", line 987, in get_bolds_and_italics
    self._bolds_italics_recurse(result, filter_cls)
  File "~/venv/lib/python3.7/site-packages/wikitextparser/_wikitext.py", line 945, in _bolds_italics_recurse
    filter_cls=filter_cls, recursive=False):
  File "~/venv/lib/python3.7/site-packages/wikitextparser/_wikitext.py", line 970, in get_bolds_and_italics
    rs, re = self._relative_contents_end
  File "~/venv/lib/python3.7/site-packages/wikitextparser/_tag.py", line 208, in _relative_contents_end
    return self._match.span('contents')
AttributeError: 'NoneType' object has no attribute 'span'