earwig / mwparserfromhell

A Python parser for MediaWiki wikicode
https://mwparserfromhell.readthedocs.io/
MIT License
741 stars 75 forks source link

Images and interwiki #222

Open bt2901 opened 4 years ago

bt2901 commented 4 years ago

I'm aware of #194 and your reservations about it.

This code was needed for my project anyway, so I figured it would do no harm to PR it. Can we salvage something? I think at least tag.py makes sense to be merged upstream.

What's, in your opinion, the best way to do it? Some sort of context object or config?

earwig commented 4 years ago

The tag.py change is quite a hack and makes me uncomfortable. What we might need here is some awareness that padding is required when stripping tags in certain cases. (As an example, you have the same problem if you try to strip "foo<br>bar", but we don't want to break "foo<i>bar</i>".) There is definitely a bug to fix here, but I need to think more carefully about how to solve it generally.

Does this just require a way to determine block vs. inline tags? Easier said than done: consider "foo<span>bar</span>" vs. "foo<span style="display: block;">bar</span>". Okay, so that's definitely an unlikely edge case. Still, I need to think about it. Something is arguably better than nothing—we can't be perfect unless we actually render the page.

Some sort of context object or config?

Yes, I think we'll want to build something like this. It can be set up automatically by the interfaces provided by pywikibot etc. so the user only needs to manage it themselves if they are using the parser directly. My main hesitation here is that once we build it, we need to commit to supporting it, and it will introduce a new maintenance burden that could be frustrating if we aren't careful. I would love if this could be pulled dynamically from the live wiki configuration somehow so we don't need to hardcode anything in the parser, and then locally cache it as appropriate. Either way, this feature would be entirely optional and would need a way to plug into interfaces from other libraries for querying the API.