earwig / mwparserfromhell

A Python parser for MediaWiki wikicode
https://mwparserfromhell.readthedocs.io/
MIT License
744 stars 75 forks source link

pywikibot pure python template param parser differences #96

Open jayvdb opened 9 years ago

jayvdb commented 9 years ago

pywikibot has a pure python template param extractor using regex, which behaves slightly different in corner cases to mwpfh.

pywikibot's mwpfh code is simply using "parsed_wikitext_obj.filter_templates(recursive=True)": https://github.com/wikimedia/pywikibot-core/blob/master/pywikibot/textlib.py#L1013

and the lovely regex solution is directly below that in the same file - lines 1021-1183.

Tests have been written to identify where they behave the same and where they behave differently. The relevant section starts here:

https://github.com/wikimedia/pywikibot-core/blob/master/tests/textlib_tests.py#L199

_extract_templates_params() is all the cases which textlib pure python and mwpfh are identical.

test_extract_templates_params_mwpfh and test_extract_templates_params_regex each include 10 assertions where they differ - the inputs to each are identical; the output is different.

The main difference is that mwpfh doesnt trim whitespace around param names and values. I'd appreciate knowing whether mwpfh might be interested in adding a mode/flag/whatever which trim's whitespace.

If our two projects can agree on expected behaviour, pywikibot can delete its pure python template parser, related tests can be put into mwpfh, and pywikibot adds mwpfh as a mandatory dependency.

ricordisamoa commented 9 years ago

mwpfh causes almost no loss of data, while pywikibot's regex-based solution does:

The expected behaviour is mwpfh's, pywikibot must drop its own parser ASAP.

earwig commented 9 years ago

I don't think the additional complexity of a strip_whitespace flag or whatever is worth it... this would just make things more confusing. In general, you should be fine using .strip(). We already have a .matches() method on Wikicode objects in the case of titles, where whitespace would matter.

jayvdb commented 9 years ago

The other major difference is that mwpfh orders inner template invocations after the template they were used in. The mwpfh approach complicates processing templates in order, as all dependencies have not been 'seen' before each template is processed.

i.e.

self.assertEqual(func('{{a|b={{c}}}}'), [('a', OrderedDict((('b', '{{c}}'), ))), ('c', OrderedDict())]) vs self.assertEqual(func('{{a|b={{c}}}}'), [('c', OrderedDict()), ('a', OrderedDict((('b', '{{c}}'), )))])

as seen at

https://github.com/wikimedia/pywikibot-core/blob/master/tests/textlib_tests.py#L246 vs https://github.com/wikimedia/pywikibot-core/blob/master/tests/textlib_tests.py#L264

earwig commented 9 years ago

TBH, I don't see the advantage of one order over the other. Can you give an example?

xZise commented 9 years ago

Regarding stripping it would be nice if it would follow the MediaWiki implementation. So that when you get {{tmpl| hello |2= world }} that it would only strip whitespaces around world and not hello. But this should be optional so you're still able to concatenate the elements back into the original string.

By the way prefer that mwpfh is doing them in order. You can get the templates non-recursively and filter templates from the parameters. That way you can also differ between {{a|{{c}}}} and {{a}}{{c}} (or {{c}}{{a}}) as all three return both templates separately and you are not able to tell if the c-template was inside the a-template.

jayvdb commented 9 years ago

I agree. It would be good to have a mode that performs stripping per the MW engine. This mode could also strip comments, noinclude, etc so the caller 'sees' roughly what the MW template processor sees.

On Tue, 28 Jul 2015 20:46 Fabian Neundorf notifications@github.com wrote:

Regarding stripping it would be nice if it would follow the MediaWiki implementation. So that when you get {{tmpl| hello |2= world }} that it would only strip whitespaces around world and not hello.

— Reply to this email directly or view it on GitHub https://github.com/earwig/mwparserfromhell/issues/96#issuecomment-125553197 .

earwig commented 9 years ago

The idea of parsing modes is something I am considering for v0.5/1.0. It seems like the best solution to this issue and related ones like #100.