earwig / mwparserfromhell

A Python parser for MediaWiki wikicode
https://mwparserfromhell.readthedocs.io/
MIT License
731 stars 74 forks source link

Add method to get URL from WikiLink #149

Open ExplodingCabbage opened 8 years ago

ExplodingCabbage commented 8 years ago

Figuring out either the absolute or relative URLs of a WikiLink is non-trivial. The code for it in mediawiki (starting at, say, https://github.com/wikimedia/mediawiki/blob/wmf/1.26wmf24/includes/Title.php#L1679) is hideous and deeply nested and requires many parameters (the link may even depend upon whether the server is running IIS 7, bizarrely), making it difficult to figure out the logic. It would be nice to port this to mwparserfromhell, taking all the necessary bits of information that can't be inferred from the markup as parameters.

lahwaacz commented 8 years ago

Using the MediaWiki's API, this is trivial enough: in this query take server and articlepath and replace the $1 placeholder with the page title.

ExplodingCabbage commented 8 years ago

@lahwaacz no - that doesn't handle escaping (at the very least we need to look at wfUrlencode to know which characters to percent-encode), and I suspect there are other complicated edge cases buried in the code that I don't know about.

lahwaacz commented 8 years ago

The standard tells you which characters in the URL need to be escaped. If you encode more, the URL might not be pretty, but it will work (assuming that it works in MediaWiki). I don't really see how this is relevant for wikitext parsing.

ExplodingCabbage commented 8 years ago

The standard tells you which characters in the URL need to be escaped.

This isn't true for several reasons.

Firstly, it isn't true even in general that the URL spec can tell you what characters you need to encode. RFC 3986 defines a set of reserved characters it calls sub-delims which don't innately have any syntactical meaning in URLs but which may be assigned syntactical meaning by particular URL schemes (including the idiosyncratic URL schemes of particular web applications), and only requires that those characters be encoded if not doing so would cause ambiguity. Thus which characters need to be percent-encoded when escaping data for use in a URL varies from website to website.

Secondly, RFC 3986 requires that all non-ASCII characters in URLs be encoded, always (see the ABNF). MediaWiki (like many websites) doesn't conform to this, since https://en.wikipedia.org/wiki/Möbius_strip is prettier than https://en.wikipedia.org/wiki/M%C3%B6bius_strip.

Thirdly, MediaWiki encodes spaces as underscores.

Fourthly, MediaWiki's choice of which ASCII characters to encode, given the constraints of spec, is somewhat arbitrary and basically impossible to predict without looking at the code or experimenting on a MediaWiki site. The IETF-defined subdelims are !$&'()*+,;=; of those, MediaWiki never encodes !$()*,;@, always encodes +='& and sometimes encodes : depending upon the webserver software running MediaWiki.

Fifthly (I think; the spec's not the most readable thing in the world), the spec requires that slashes be encoded in the path section of a URL unless they are being used to delimit path segments, and MediaWiki deliberately violates the spec by not conforming to this (e.g. https://en.wikipedia.org/wiki// is the page about the forward slash on Wikipedia).

If you encode more, the URL might not be pretty, but it will work

True, but:

I don't really see how this is relevant for wikitext parsing.

While I can see how this could be argued either way, knowing where a WikiLink points to seems to me like part of the process of extracting meaning from the WikiText and therefore reasonably within the scope of a parser. Also, whether it's proper to see it as a parsing step or not, it's just plain useful, is non-trivial to implement, and there's not really a better library for such a function to live in than this one.

earwig commented 8 years ago

This falls under the class of issues that require MWPFH to have some knowledge of the parsing environment that it can't gather from the source alone.

Those are tricky under the current system; we don't have a way to specify site details when parsing, and adding one will make MWPFH's "plug-and-play" nature a lot more frustrating.

What a mess!

lahwaacz commented 8 years ago

If you need something canonical to identify the target, you can easily compare the escaped URLs instead of something (half-)human-readable. But identifying by pageid (or something like siteid:pageid if you work with multiple sites) is even better since pages can be renamed. For resolving MediaWiki's redirects you'll need to talk to the API anyway, and MWPFH will most likely never talk to the API or database directly, otherwise we'd have the entire MediaWiki core effectively ported to Python.

ghost commented 8 years ago

I agree with Earwig. Wiki-markup exists independent of any website; it is context-free by the nature of its existence as a language.

legoktm commented 8 years ago

Wiki-markup exists independent of any website; it is context-free by the nature of its existence as a language.

Errr...it's context-sensitive: https://www.mediawiki.org/wiki/Markup_spec#Feasibility_study. And things like which parser tags and functions work are all dependent upon what is installed, which is specific to individual websites.

ghost commented 8 years ago

I did not intend to mean "context-free" by the formal definition of languages. In this context, it is clear that the context in question is the site that it exists on, as this entire discussion is about adding the URLs of a website to the wikilinks.