appledora / mwparserfromhtml

An unofficial mirror of our repo of the `mwparserfromhtml` package. It is a python library for working with the HTML dumps. Since this is only a mirror, DO NOT PR.
https://pypi.org/project/mwparserfromhtml/
MIT License
4 stars 0 forks source link

Handle inline transclusion differently in plaintext extraction #41

Open appledora opened 2 years ago

appledora commented 2 years ago

In GitLab by @geohci on Aug 30, 2022, 24:21

Example: for the en:Cabbage article, the second paragraph of plaintext skipping transclusion is A cabbage generally weighs between . because the HTML is actually <p id="mwHg">A cabbage generally weighs between <span about="#mwt15" typeof="mw:Transclusion" data-mw='{"parts":[{"template":{"target":{"wt":"convert","href":"./Template:Convert"},"params":{"1":{"wt":"500"},"2":{"wt":"to"},"3":{"wt":"1000"},"4":{"wt":"g"},"5":{"wt":"lbs"},"sigfig":{"wt":"1"}},"i":0}}]}' id="mwHw">500 to 1,000 grams (1 to 2</span><span typeof="mw:Entity" about="#mwt15"> </span><span about="#mwt15">lb)</span>. and the wikitext is A cabbage generally weighs between {{convert|500|to|1000|g|lbs|sigfig=1}}.

Maybe we can have an option that only excludes transclusion when it happens inside certain types of elements instead of being the parent element?