5j9 / wikitextparser

A Python library to parse MediaWiki WikiText
GNU General Public License v3.0
289 stars 22 forks source link

fix(wikilink): empty wikilinks are internally not replaced with _ #78

Closed TrueBrain closed 1 year ago

TrueBrain commented 4 years ago

"[[|alt]]" is a valid wikilink, but WIKILINK_PARAMFINDITER did not consider it valid. In result, it was seens as a wikilink, but not marked as for, as example, a parser function to handle it. This meant a parser function saw "alt ]]" as another argument, where it really was not.

This was a very interesting one to track down. VALID_TITLE_CHARS is also used by PF_TL_FINDITER, which now also has its behaviour changed. As I am not fluent in reading regex (is anyone really?), I am not sure what the side-effect of that is.

Alternative, the ++ and *+ marker could be moved outside of VALID_TITLE_CHARS, keeping PF_TL_FINDITER the same in behaviour. Let me know if you prefer this.

Ironically, mediawiki has a bug here, where the given test-case works fine, but if the wikilink is [[{{{empty}}}|Alt]] it does fail. wikitextparser, with this PR, does the right thing here too now.

codecov[bot] commented 4 years ago

Codecov Report

Merging #78 into master will not change coverage. The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff            @@
##            master       #78   +/-   ##
=========================================
  Coverage   100.00%   100.00%           
=========================================
  Files           31        31           
  Lines         4260      4263    +3     
=========================================
+ Hits          4260      4263    +3     
Impacted Files Coverage Δ
tests/test_parser_function.py 100.00% <100.00%> (ø)
wikitextparser/_spans.py 100.00% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update a595b29...f2ccb33. Read the comment docs.

5j9 commented 4 years ago

Well done tracking this one down.

This is what they call "Reverse pipe trick"1 and it is only valid in input wikitext. How Mediawiki will expands the [[|alt]] syntax depends on the title of the page. For example if the title of the page is alt (something), then [[|alt]] will be replaced with [[alt (something)|alt]]. Since wikitextparser currently does not know about the context, it cannot tell how exactly that syntax should be treated. However it still might be a good idea to detect it as a wikilink and let the user decide what to do with it.

A minor issue: now [[ |Alt]] (with is a space between [[ and |) will be detected as a wikilink, but mediawiki does not do that. Actually the [[{{{empty}}}|Alt]] example that you gave above is of the same kind. Apparently mediawiki does the reverse pipe trick as a preprocessing step, before expanding other objects like templates or even comments. Therefore even [[<!-- -->|alt]] is not detected as a wikilink. Maybe this issue is not that important... I'll need to think more about it and its use cases.

TrueBrain commented 4 years ago

One thing I learnt in the last week or so, that I wish I never took the time to look into how meediawiki parses wikitext, and how much the specs are more "what-ever our code does" is :D

I found this page which contains a rant that says exactly how I feel: https://docs.rs/parse_wiki_text/0.1.5/parse_wiki_text/

And this information you just give, adds perfectly in that picture :) I am not even surprised :P


Honestly, the main problem for me is not so much that it is not a valid wikilink, but the fact the | is seen as part of the parent it is in. In our wiki this is part of nearly every page: {{#if:{{{en|}}}|<div (..) > (..) <br/>[[{{{en}}}|EN]]</div>|}} (the (..) are by me; it is a lengthy line). If I split it out a bit more for readability:

{{
#if:
{{{en|}}}
|
<div (..) > (..) <br/>[[{{{en}}}|EN]]</div>
|
}}

The problem is, that without acknowledging [[ ]] as a wikilink, when en is empty, |EN]]</div> is seen as the else part of the if:

{{
#if:
|
<div (..) > (..) <br/>[[
|
EN]]</div>
|
}}

Which is something mediawiki does not do. It does seem to skip the pipe in the [[ ]] even so it is not a valid wikilink (if that makes any sense?). If you know of another way to solve this problem, I would also be very happy. This was an means to an end, basically :) On the other hand, I am also perfectly fine if you close this PR unmerged, as it really only happens if you process wikitext, and feed that back to the parser again.

TrueBrain commented 4 years ago

After some more consideration, and with the info given with you, I think we are better off closing this Pull Request. I now have added a simple piece of code after I process parameters:

    if "[[|" in wikitext.string:
        wikitext.string = re.sub(r'\[\[\|([^\]]*)\]\]', r'%5B%5B%7C\1%5D%5D', wikitext.string)

Which resolves my problem. It is a bit dirty, but it works out just fine :)

I leave it up to you if you want this Pull Request or not :)

5j9 commented 1 year ago

I ended up merging this pull request in a02b9e96e . I decided that false positives are tolerable and can easily be ignored, but the given #if example is much worse and can cause the parser function to be interpreted in a wrong manner.

Thank you so much for creating this pull request!