5j9 / wikitextparser

A Python library to parse MediaWiki WikiText
GNU General Public License v3.0
289 stars 22 forks source link

Infinite loop with CR (\x09) in table parsing #107

Closed TrueBrain closed 2 years ago

TrueBrain commented 2 years ago

In one of my test-sets I forgot to sanitize the input, and in result I had a \r (without any \n). This caused a funny effect I thought you might want to know.

import wikitextparser
wikitextparser.parse("{|\n}\n\r").get_tables()[0].data()

This causes an infinite loop. Similar, if you replace \r with \x0b or \x0c, but that is even more nonsense ofc.

https://github.com/5j9/wikitextparser/blob/f64e098b0ba040595f6fc427edf6409308761bd0/wikitextparser/_table.py#L94 returns -1, after which _lstrip_increase increases that back to 0, and it repeats.

Personally, I think this is not a bug in your library, as a string ending on a \r is just weird. But I didn't want to keep this finding from you either, just in case I am missing something else here :)

I also found a possibly related issue. For example:

import wikitextparser
wikitextparser.parse("{|\n}\n").get_tables()[0].data()

triggers:

IndexError: bytearray index out of range on line 93 of _table.py.. Sadly, this is text users in TrueWiki have been entering, but I can capture that error on my side. Just mentioning it, as there might be something else going on here actually :)

As always, tnx for the awesome library! :D

5j9 commented 2 years ago

I fixed the case where \r caused infinite loop in table, however \r is not supported in general and I'm sure it will cause other parsing issues in other areas. The IndexError issue was fixed in the previous version.

TrueBrain commented 2 years ago

You are the best, thank you for these fixes :D