Open jayvdb opened 9 years ago
mwpfh causes almost no loss of data, while pywikibot's regex-based solution does:
The expected behaviour is mwpfh's, pywikibot must drop its own parser ASAP.
I don't think the additional complexity of a strip_whitespace
flag or whatever is worth it... this would just make things more confusing. In general, you should be fine using .strip()
. We already have a .matches()
method on Wikicode
objects in the case of titles, where whitespace would matter.
The other major difference is that mwpfh orders inner template invocations after the template they were used in. The mwpfh approach complicates processing templates in order, as all dependencies have not been 'seen' before each template is processed.
i.e.
self.assertEqual(func('{{a|b={{c}}}}'), [('a', OrderedDict((('b', '{{c}}'), ))), ('c', OrderedDict())]) vs self.assertEqual(func('{{a|b={{c}}}}'), [('c', OrderedDict()), ('a', OrderedDict((('b', '{{c}}'), )))])
as seen at
https://github.com/wikimedia/pywikibot-core/blob/master/tests/textlib_tests.py#L246 vs https://github.com/wikimedia/pywikibot-core/blob/master/tests/textlib_tests.py#L264
TBH, I don't see the advantage of one order over the other. Can you give an example?
Regarding stripping it would be nice if it would follow the MediaWiki implementation. So that when you get {{tmpl| hello |2= world }}
that it would only strip whitespaces around world
and not hello
. But this should be optional so you're still able to concatenate the elements back into the original string.
By the way prefer that mwpfh is doing them in order. You can get the templates non-recursively and filter templates from the parameters. That way you can also differ between {{a|{{c}}}}
and {{a}}{{c}}
(or {{c}}{{a}}
) as all three return both templates separately and you are not able to tell if the c-template was inside the a-template.
I agree. It would be good to have a mode that performs stripping per the MW engine. This mode could also strip comments, noinclude, etc so the caller 'sees' roughly what the MW template processor sees.
On Tue, 28 Jul 2015 20:46 Fabian Neundorf notifications@github.com wrote:
Regarding stripping it would be nice if it would follow the MediaWiki implementation. So that when you get {{tmpl| hello |2= world }} that it would only strip whitespaces around world and not hello.
— Reply to this email directly or view it on GitHub https://github.com/earwig/mwparserfromhell/issues/96#issuecomment-125553197 .
The idea of parsing modes is something I am considering for v0.5/1.0. It seems like the best solution to this issue and related ones like #100.
pywikibot has a pure python template param extractor using regex, which behaves slightly different in corner cases to mwpfh.
pywikibot's mwpfh code is simply using "parsed_wikitext_obj.filter_templates(recursive=True)": https://github.com/wikimedia/pywikibot-core/blob/master/pywikibot/textlib.py#L1013
and the lovely regex solution is directly below that in the same file - lines 1021-1183.
Tests have been written to identify where they behave the same and where they behave differently. The relevant section starts here:
https://github.com/wikimedia/pywikibot-core/blob/master/tests/textlib_tests.py#L199
_extract_templates_params() is all the cases which textlib pure python and mwpfh are identical.
test_extract_templates_params_mwpfh and test_extract_templates_params_regex each include 10 assertions where they differ - the inputs to each are identical; the output is different.
The main difference is that mwpfh doesnt trim whitespace around param names and values. I'd appreciate knowing whether mwpfh might be interested in adding a mode/flag/whatever which trim's whitespace.
If our two projects can agree on expected behaviour, pywikibot can delete its pure python template parser, related tests can be put into mwpfh, and pywikibot adds mwpfh as a mandatory dependency.