earwig / mwparserfromhell

A Python parser for MediaWiki wikicode
https://mwparserfromhell.readthedocs.io/
MIT License
758 stars 75 forks source link

The tokenizer incorrectly handles some difficult tag-related markup #40

Open earwig opened 11 years ago

earwig commented 11 years ago
  1. Bold and italics that cross contexts are handled incorrectly, because the tree structure does not support overlapping nodes (for example, ''foo'''bar''baz''', or ''foo{{bar|baz''}}). Fixing this will probably be very difficult.
  2. Open tags that do not have a close tag before the parser reaches EOF are ignored, whereas some of them should be parsed (like bold and italics) and have some kind of "hidden close" flag set.
  3. MediaWiki counts the occurrences of ; in the block before any text and uses this as the maximum number of parsable :s after. The current implementation only allows one : regardless of how many ;s there are.
  4. MediaWiki prevents some tags from crossing certain contexts (italics and bold can't cross headings, for example) but this implementation has no such restriction.
  5. The parser only recognizes a space as the separator character between the URL and its link title in [ ] tags, but MediaWiki also accepts some other syntax (e.g. [http://example.com/''Example''] is valid).

1, 4, and 5 are high priority, whereas 2 is mid and 3 is low.

earwig commented 11 years ago

Regarding (1), a line from MediaWiki's source:

            # ''Something [http://www.cool.com cool''] -->
            # <i>Something</i><a href="http://www.cool.com"..><i>cool></i></a>
ghost commented 11 years ago

Also, this.

== Something ==
'' Hello, world!

== Something else ==
Lorem ipsum dolor sit amet.''
earwig commented 11 years ago

So it seems italics/bold can't cross links but can cross templates. I need to figure exactly which nodes are restrictive.

earwig commented 11 years ago

1946cf6

Prillan commented 10 years ago

Hi! There seems to be a case you've missed.

Bold (and italics I guess) are implicitly closed when wikitable cells end. E.g. http://wiki.teamliquid.net/starcraft2/index.php?title=2014_WCS_Season_1_Europe/Premier&oldid=687367

{| class="wikitable"
|width=190px bgcolor="{{RaceColor|p}}" align="center" | '''{{p}} Protoss ''(13)''
|width=190px bgcolor="{{RaceColor|t}}" align="center" | '''{{t}} Terran ''(8)''
|width=190px bgcolor="{{RaceColor|z}}" align="center" | '''{{z}} Zerg ''(11)''

gives

<table class="wikitable">
<tr>
<td width="190px" bgcolor="#B8F2B8" align="center"> <b><a href="/starcraft2/File:Picon_small.png" class="image" title="Protoss"><img alt="Protoss" src="/starcraft/images2/a/ab/Picon_small.png" width="17" height="15" /></a> Protoss <i>(13)</i></b>
</td>
<td width="190px" bgcolor="#B8B8F2" align="center"> <b><a href="/starcraft2/File:Ticon_small.png" class="image" title="Terran"><img alt="Terran" src="/starcraft/images2/9/9d/Ticon_small.png" width="17" height="15" /></a> Terran <i>(8)</i></b>
</td>
<td width="190px" bgcolor="#F2B8B8" align="center"> <b><a href="/starcraft2/File:Zicon_small.png" class="image" title="Zerg"><img alt="Zerg" src="/starcraft/images2/c/c9/Zicon_small.png" width="17" height="15" /></a> Zerg <i>(11)</i></b>
</td>
earwig commented 10 years ago

Hmm... yeah, that's tough because the parser doesn't understand tables yet. I'll need to add that before this is fixable.

danvk commented 10 years ago

Pulling in a workaround from #80: @earwig suggested passing skip_style_tags=True to mwparserfromhell.parse to work around @Prillan's issue. This worked perfectly.

To get this feature, I had to track the development version on github rather than the released version on PyPI. Here's the line from my requirements.txt:

-e git+https://github.com/earwig/mwparserfromhell.git#egg=mwparserfromhell
earwig commented 9 years ago

Most of this is going to require an overhaul of how parsing is done (I finally have an idea how I'm going to do it, but it'll be a lot of work)... so pushing this back as the main task for v1.0.

lahwaacz commented 8 years ago

Consider this wikitext:

''foo
bar''

MediaWiki 1.26 parses this as

<i>foo</i>
bar

which suggests that style markup cannot span across multiple lines. mwparserfromhell does this the hard/old? way:

\n
<
      i
>
      foo\nbar
</
      i
>
\n
earwig commented 8 years ago

Oh joy.

mhsmith commented 8 years ago

almond.txt

The attached file is a reduced version of https://en.wikipedia.org/w/index.php?title=Almond&oldid=706024513. I'd like to reduce it more, but any structural change anywhere in the text makes the problem disappear, so I don't know if this is actually an instance of this bug.

The initial table is parsed correctly, subject to point 2 above, i.e. the unclosed <small> and <center> tags are returned as plain text. But everything after the table is returned as plain text too, with the exception of headings and lists. For example:

===
       Almond flour and skins
===
\n[[Almond flour]] is often used as a [[gluten-free]] alternative to wheat flour

Replicating the initial line, like this:

{|
|-
| Production<small>(million tonnes)
|-
| Production<small>(million tonnes)
|-
| {{flag|USA}} || style="text-align:center;"|<center> 1.8
|-

Results in the rest of the table not being parsed either:

<
      table
>
      <
            tr
      >
            <
                  td
            >
                   Production<small>(million tonnes)\n
            </
                  td
            >
      </
            tr
      >
      |-\n| Production<small>(million tonnes)\n|-\n| {{flag|USA}} || style="text-align:center;"|<center> 1.8\n|-\n| {{flag|Australia}} || style="text-align:center;"|<center> 0.16\n|-\n| {{flag|Spain}} || style="text-align:center;" |<center> 0.15\n|-\n| {{flag|Morocco}} || style="text-align:center;"|<center> 0.1\n|-\n| {{flag|Iran}} || style="text-align:center;"|<center> 0.09\n|-\n!'''World''' !! style="text-align:center;"|<center> '''2.92'''\n
</
      table
>
mhsmith commented 8 years ago

Here's a really weird example from https://fr.wikipedia.org/w/index.php?title=Opposition_p%C3%A9rih%C3%A9lique&oldid=112493222 :

[[Image:Opposition périhélique.PNG|thumb|250px|Schéma présentant les oppositions périhélique et aphélique de la {{quoi|[[Terre]] et de [[Mars (planète)|Mars]]]]
On dit que deux corps célestes sont en '''opposition périhélique''' lorsque tous deux sont simultanément au [[périhélie]] de leur orbite en alignement parfait avec le [[Soleil]]. Il en résulte que la distance entre ces deux corps célestes est alors minimale.}}

With the template interrupted by the end of the image context, MediaWiki appears to actually invoke the template twice in order to achieve the author's (presumed) intention.

vladiscripts commented 8 years ago

Answer on #148 Perhaps ... Many of pages with this issue AWB marks as "have unclosed tags". But not all, e.g. no a tag errors in https://ru.wikipedia.org/w/index.php?title=%D0%9B%D0%B8%D0%BC%D0%BE%D0%BD&oldid=76351442. This page without errors too.

Tables placed in one sections of pages, but parser doesn't see templates in other sections. Could add function recognition "== ==" as secondary mark end of tables?

bfontaine commented 3 years ago

Other weird ones with malformed italics in templates:

mwparserfromhell.parse("{{foo|''bar}} {{foo|bar''}}").filter_templates()
# => ["{{foo|''bar}}", "{{foo|bar''}}"]

mwparserfromhell.parse("{{foo|''bar}} ''...'' {{foo|bar''}}").filter_templates()
# => ["{{foo|bar''}}"]

mwparserfromhell.parse("{{foo|''bar}} ''").filter_templates()
# => []

mwparserfromhell.parse("{{foo|''bar}} ''bar''").filter_templates()
# => []

mwparserfromhell.parse("{{foo|''bar}}").filter_templates()
# => ["{{foo|''bar}}"]