5j9 / wikitextparser

A Python library to parse MediaWiki WikiText
GNU General Public License v3.0
288 stars 22 forks source link

Tables parsing hangs forever #22

Closed Paikan closed 4 years ago

Paikan commented 4 years ago

Hi @5j9 and first thanks you for this great project.

I have an issue with tables parsing that hangs forever. It is easy to reproduce the issue with the following gist: https://gist.github.com/Paikan/5b8e3e7939553d7848f9c6dc34daebb2

Thanks

5j9 commented 4 years ago

Hi, Honestly, I could not figure out what was going on in the underlying regex library. Implemented a quick fix, hoping it won't break anything else. (I think it would be better to rewrite that part of code and use some other method, but don't have enough free time to do it right now.)

A new version with the mentioned fix s now available on PYPI. Thanks for the bug report. Let me know if you find any other ones.

Paikan commented 4 years ago

Thanks a lot for your reactivity and hard work.

I can confirm that the problem is fixed with the new release and hope it won't break anything else.

I am about to test this version with full wikipedia dumps for english and french. I will let you know if anything goes wrong.

suhassumukh commented 4 years ago

Honestly, I could not figure out what was going on in the underlying regex library. Implemented a quick fix, hoping it won't break anything else. (I think it would be better to rewrite that part of code and use some other method, but don't have enough free time to do it right now.)

I'm no linguistic expert but doesn't regex stop working somewhere when parsing any markup language? Don't you need to use a CFG?

suhassumukh commented 4 years ago

Honestly, I could not figure out what was going on in the underlying regex library. Implemented a quick fix, hoping it won't break anything else. (I think it would be better to rewrite that part of code and use some other method, but don't have enough free time to do it right now.)

I'm no linguistic expert but doesn't regex stop working somewhere when parsing any markup language? Don't you need to use a CFG?

@5j9 A quick research shows that WIkitext is neither. The Wikitext spec says it belongs to CSG. I'd say (although don't bank on my opinions), that using regex has its share of troubles and it is impossible to fully describe the language with it.