Open earwig opened 11 years ago
Regarding (1), a line from MediaWiki's source:
# ''Something [http://www.cool.com cool''] -->
# <i>Something</i><a href="http://www.cool.com"..><i>cool></i></a>
Also, this.
== Something ==
'' Hello, world!
== Something else ==
Lorem ipsum dolor sit amet.''
So it seems italics/bold can't cross links but can cross templates. I need to figure exactly which nodes are restrictive.
1946cf6
Hi! There seems to be a case you've missed.
Bold (and italics I guess) are implicitly closed when wikitable cells end. E.g. http://wiki.teamliquid.net/starcraft2/index.php?title=2014_WCS_Season_1_Europe/Premier&oldid=687367
{| class="wikitable"
|width=190px bgcolor="{{RaceColor|p}}" align="center" | '''{{p}} Protoss ''(13)''
|width=190px bgcolor="{{RaceColor|t}}" align="center" | '''{{t}} Terran ''(8)''
|width=190px bgcolor="{{RaceColor|z}}" align="center" | '''{{z}} Zerg ''(11)''
gives
<table class="wikitable">
<tr>
<td width="190px" bgcolor="#B8F2B8" align="center"> <b><a href="/starcraft2/File:Picon_small.png" class="image" title="Protoss"><img alt="Protoss" src="/starcraft/images2/a/ab/Picon_small.png" width="17" height="15" /></a> Protoss <i>(13)</i></b>
</td>
<td width="190px" bgcolor="#B8B8F2" align="center"> <b><a href="/starcraft2/File:Ticon_small.png" class="image" title="Terran"><img alt="Terran" src="/starcraft/images2/9/9d/Ticon_small.png" width="17" height="15" /></a> Terran <i>(8)</i></b>
</td>
<td width="190px" bgcolor="#F2B8B8" align="center"> <b><a href="/starcraft2/File:Zicon_small.png" class="image" title="Zerg"><img alt="Zerg" src="/starcraft/images2/c/c9/Zicon_small.png" width="17" height="15" /></a> Zerg <i>(11)</i></b>
</td>
Hmm... yeah, that's tough because the parser doesn't understand tables yet. I'll need to add that before this is fixable.
Pulling in a workaround from #80: @earwig suggested passing skip_style_tags=True
to mwparserfromhell.parse
to work around @Prillan's issue. This worked perfectly.
To get this feature, I had to track the development version on github rather than the released version on PyPI. Here's the line from my requirements.txt
:
-e git+https://github.com/earwig/mwparserfromhell.git#egg=mwparserfromhell
Most of this is going to require an overhaul of how parsing is done (I finally have an idea how I'm going to do it, but it'll be a lot of work)... so pushing this back as the main task for v1.0.
Consider this wikitext:
''foo
bar''
MediaWiki 1.26 parses this as
<i>foo</i>
bar
which suggests that style markup cannot span across multiple lines. mwparserfromhell does this the hard/old? way:
\n
<
i
>
foo\nbar
</
i
>
\n
Oh joy.
The attached file is a reduced version of https://en.wikipedia.org/w/index.php?title=Almond&oldid=706024513. I'd like to reduce it more, but any structural change anywhere in the text makes the problem disappear, so I don't know if this is actually an instance of this bug.
The initial table is parsed correctly, subject to point 2 above, i.e. the unclosed <small> and <center> tags are returned as plain text. But everything after the table is returned as plain text too, with the exception of headings and lists. For example:
=== Almond flour and skins === \n[[Almond flour]] is often used as a [[gluten-free]] alternative to wheat flour
Replicating the initial line, like this:
{| |- | Production<small>(million tonnes) |- | Production<small>(million tonnes) |- | {{flag|USA}} || style="text-align:center;"|<center> 1.8 |-
Results in the rest of the table not being parsed either:
< table > < tr > < td > Production<small>(million tonnes)\n </ td > </ tr > |-\n| Production<small>(million tonnes)\n|-\n| {{flag|USA}} || style="text-align:center;"|<center> 1.8\n|-\n| {{flag|Australia}} || style="text-align:center;"|<center> 0.16\n|-\n| {{flag|Spain}} || style="text-align:center;" |<center> 0.15\n|-\n| {{flag|Morocco}} || style="text-align:center;"|<center> 0.1\n|-\n| {{flag|Iran}} || style="text-align:center;"|<center> 0.09\n|-\n!'''World''' !! style="text-align:center;"|<center> '''2.92'''\n </ table >
Here's a really weird example from https://fr.wikipedia.org/w/index.php?title=Opposition_p%C3%A9rih%C3%A9lique&oldid=112493222 :
[[Image:Opposition périhélique.PNG|thumb|250px|Schéma présentant les oppositions périhélique et aphélique de la {{quoi|[[Terre]] et de [[Mars (planète)|Mars]]]]
On dit que deux corps célestes sont en '''opposition périhélique''' lorsque tous deux sont simultanément au [[périhélie]] de leur orbite en alignement parfait avec le [[Soleil]]. Il en résulte que la distance entre ces deux corps célestes est alors minimale.}}
With the template interrupted by the end of the image context, MediaWiki appears to actually invoke the template twice in order to achieve the author's (presumed) intention.
Answer on #148 Perhaps ... Many of pages with this issue AWB marks as "have unclosed tags". But not all, e.g. no a tag errors in https://ru.wikipedia.org/w/index.php?title=%D0%9B%D0%B8%D0%BC%D0%BE%D0%BD&oldid=76351442. This page without errors too.
Tables placed in one sections of pages, but parser doesn't see templates in other sections. Could add function recognition "== ==" as secondary mark end of tables?
Other weird ones with malformed italics in templates:
mwparserfromhell.parse("{{foo|''bar}} {{foo|bar''}}").filter_templates()
# => ["{{foo|''bar}}", "{{foo|bar''}}"]
mwparserfromhell.parse("{{foo|''bar}} ''...'' {{foo|bar''}}").filter_templates()
# => ["{{foo|bar''}}"]
mwparserfromhell.parse("{{foo|''bar}} ''").filter_templates()
# => []
mwparserfromhell.parse("{{foo|''bar}} ''bar''").filter_templates()
# => []
mwparserfromhell.parse("{{foo|''bar}}").filter_templates()
# => ["{{foo|''bar}}"]
''foo'''bar''baz'''
, or''foo{{bar|baz''}}
). Fixing this will probably be very difficult.;
in the block before any text and uses this as the maximum number of parsable:
s after. The current implementation only allows one:
regardless of how many;
s there are.[ ]
tags, but MediaWiki also accepts some other syntax (e.g.[http://example.com/''Example'']
is valid).1, 4, and 5 are high priority, whereas 2 is mid and 3 is low.