earwig / mwparserfromhell

A Python parser for MediaWiki wikicode
https://mwparserfromhell.readthedocs.io/
MIT License
757 stars 75 forks source link

C tokenizer exited with BAD_ROUTE while parsing a page with a lot of tables. #248

Closed halfak closed 4 years ago

halfak commented 4 years ago

I get an error while parsing https://pt.wikipedia.org/wiki/Lista_de_finais_para_cadeirantes_do_US_Open. The page has a lot of tables and that might be related.

$ python
Python 3.5.1+ (default, Mar 30 2016, 22:46:26) 
[GCC 5.3.1 20160330] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import mwapi
>>> doc = mwapi.Session("https://pt.wikipedia.org").get(action='query', prop='revisions', revids=[57906793], rvslots=["main"], rvprop=["content"], formatversion=2)
Sending requests with default User-Agent.  Set 'user_agent' on mwapi.Session to quiet this message.
>>> text = doc['query']['pages'][0]['revisions'][0]['slots']['main']['content']
>>> import mwparserfromhell
>>> mwparserfromhell.__version__
'0.5.4'
>>> mwparserfromhell.parse(text)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/halfak/venv/3.5/lib/python3.5/site-packages/mwparserfromhell/utils.py", line 58, in parse_anything
    return Parser().parse(value, context, skip_style_tags)
  File "/home/halfak/venv/3.5/lib/python3.5/site-packages/mwparserfromhell/parser/__init__.py", line 93, in parse
    tokens = self._tokenizer.tokenize(text, context, skip_style_tags)
mwparserfromhell.parser.ParserError: This is a bug and should be reported. Info: C tokenizer exited with BAD_ROUTE.
earwig commented 4 years ago

This bug was already fixed in the development branch. I'll publish a release (soon).

halfak commented 4 years ago

Thank you!

tbayer commented 4 years ago

I just encountered the same bug for the same ptwiki page ;) Looking forward to the new release.