earwig / mwparserfromhell

A Python parser for MediaWiki wikicode
https://mwparserfromhell.readthedocs.io/
MIT License
741 stars 74 forks source link

KeyError on strip_code() #288

Closed Mmdixon closed 2 years ago

Mmdixon commented 2 years ago

The code:

import mwparserfromhell
wikicode = mwparserfromhell.parse(text)
wikicode.strip_code()

The problematic text I ran into was on this page: https://es.wikipedia.org/?curid=5006152. Note the characters 000nbsp that came from the sentence:

Se destaca la formación de bosque de ''[[Polylepis]]'', ''qiwuña'', "quinoa" o "árbol de papel", el cual tiene entre 8 y 10 [[metro|m]] de altura, y crece a la orilla de las lagunas o quebradas y en lugares rocosos y es la única especie de árbol por encima de 4&000nbsp;[[msnm]].

The error:

File ".../extract.py", line 637, in main
    wikicode.strip_code()
  File ".../lib/python3.8/site-packages/mwparserfromhell/wikicode.py", line 666, in strip_code
    stripped = node.__strip__(**kwargs)
  File ".../lib/python3.8/site-packages/mwparserfromhell/nodes/html_entity.py", line 61, in __strip__
    return self.normalize()
  File ".../lib/python3.8/site-packages/mwparserfromhell/nodes/html_entity.py", line 157, in normalize
    return chr(htmlentities.name2codepoint[self.value])
KeyError: '000nbsp'
Python 3.8.10
mwparserfromhell 0.6.4

The expectation was for no exception to be thrown and if the text is problematic to be ignored or parsed differently.

earwig commented 2 years ago

Thank you for the bug report! It was something very specific for extra zeroes at the start of an HTML entity name. Now fixed in f963df7f824c2dbcc6bc159e1dba57b082b062fe.

appledora commented 1 year ago

@earwig is the fix available on the pip version too? I also recently faced this issue. Thanks to @Mmdixon for the timely report!!

NikitaKononov commented 1 year ago

Thank you for the bug report! It was something very specific for extra zeroes at the start of an HTML entity name. Now fixed in f963df7.

Hello, in pip version mwparserfromhell==0.6.4 problem still exists, have faced it just now.

earwig commented 1 year ago

Thanks so much for your patience. 0.6.5 is now released on PyPI with this fix.