Closed maxjeblick closed 6 months ago
I noticed that math formulas will be excluded by the hardcoded PARSER_BLACKLIST. Would be nice if those would be configurable via some common settings (that can be changed from the user side).
You don't actually want to change PARSER_BLACKLIST
; that means the tag doesn't contain wikicode (which it doesn't) to avoid confusing the parser during the initial parse.
What you want to do is remove math
from INVISIBLE_TAGS
, just below where you linked to. That doesn't affect the initial parse, just the behavior of strip_code
.
Adding more configuration to strip_code
to allow customizing the visible tags (or generically whether a given node should be visible) is a good idea. In the meantime, calling mwparserfromhell.definitions.INVISIBLE_TAGS.remove("math")
somewhere in your code—while inelegant—should do what you want:
>>> import mwparserfromhell
>>> mwparserfromhell.definitions.INVISIBLE_TAGS.remove("math")
>>> c = mwparserfromhell.parse("foo <math>a + b</math> bar")
>>> c.strip_code()
'foo a + b bar'
Thanks a lot for the quick answer! Will close, as the workaround is sufficient for my purposes.
I noticed the Wikipedia parser https://huggingface.co/datasets/wikipedia/blob/main/wikipedia.py deletes formulas such as
<math>a + bi</math>
in the article... every complex number can be expressed in the form <math>a + bi</math>, where ..
I wonder how to keep these while cleaning the text as in the original script otherwise. I tried to modify the
section.strip_code()
part below but wasn't able to include the formulas correctly. Any help appreciated!Minimal example:
which gives