Text extraction from File: wikilinks has an issue

earwig / mwparserfromhell

A Python parser for MediaWiki wikicode

https://mwparserfromhell.readthedocs.io/

MIT License

750 stars 75 forks source link

Text extraction from File: wikilinks has an issue #87

Open ikuyamada opened 9 years ago

ikuyamada commented 9 years ago

mwparserfromhell seemingly has an issue to extract text from "File:" wikilinks with additional attributes.

In [1]: import mwparserfromhell
In [2]: w = "[[File:test.jpg|thumb|Label text]]"
In [3]: mwparserfromhell.parse(w).nodes[0].text
Out[3]: u'thumb|Label text'

I think the desired output is not "thumb|Label text" but "Label text".

Technical-13 commented 9 years ago

@ikuyamada I would actually expect it to spit out an array containing("thumb","Label text"). I'm guessing that it just hasn't evolved to that yet, and lacking that kind of support, "thumb|Label text" seems correct to me.

earwig commented 9 years ago

"thumb|Label text" is correct, since the parser treats all wikilink-like things the same way. Ideally, we would understand what a file is and treat its caption specially (so you could do node.caption instead of node.text, which would give the entire chunk), but this is problematic since we don't have a reliable way to determine what is a file link and what isn't, due to site- and language-specific namespace aliases. I suppose we could just have .caption exist for all links, but this would entail new parsing rules. I'm willing to add this since it's been requested before.

Technical-13 commented 9 years ago

Feel free to :fish: me if it is already in there, but does this mean that you are going to have it parse the whole string to have it output node.height, node.width, node.align, node.valign, node.mode (thumb, frameless, etc), node.link? If you are going to parse out each chunk, then you might as well put them in their own places.

earwig commented 9 years ago

Hm... that's a bit clunky, but I suppose it's better than having a dictionary or some other alternative I can't think of right now.

ricordisamoa commented 9 years ago

Many arguments for file links can also have localized forms...