CXuesong / MwParserFromScratch

A basic .NET Library for parsing wikitext into AST.
Apache License 2.0
18 stars 5 forks source link

Parser seems to hang, but just takes a LONG time #14

Closed RobSchoenaker closed 4 years ago

RobSchoenaker commented 4 years ago

<ol start="18"> <li> Martin Pfyffer von Altishofen - LU (1835-1847) <li> Franz Xaver Leopold Meyer von Schauensee - LU (1848-1860) <li> Alfred von Sonnenberg - LU (1860-1878) <li> Louis-Martin de Courten - [[Wallis (kanton)|VS]] (1878-1901) <li> Leopold Meyer von Schauensee - LU (1901-1910) <li> Jules Repond - [[Fribourg (kanton)|FR]] (1910-1921) <li> Alois Hirschbühl - [[Graubünden|GR]] (1921-1935) <li> Georg von Sury d’Aspremont - [[Solothurn (kanton)|SO]] (1935-1942) <li> Heinrich Pfyffer von Altishofen - LU (1942-1957) <li> Robert Nünlist - LU (1957-1972) <li> Franz Pfyffer von Altishofen - LU (1972-1982) <li> [[Alois Estermann]] - LU (1998-1998) <li> Pius Segmüller - [[Sankt Gallen (kanton)|SG]] (1998-2002) <li> Elmar Theodor Mäder - SG (2002-2008) <li> [[Daniel Anrig|Daniel Rolf Anrig]] - SG (2008-2015) <li> [[Christoph Graf]] - LU (vanaf 2015) </li></ol>

This is an excerpt from a dutch Wikipedia entry. On my machine, this takes about 10 seconds to parse. It seems to be related to the number of <li> items. When I remove one, it's fast(er), but when I add one, it's time for coffee...

CXuesong commented 4 years ago

I've fed the text into ParserTracer project in the repo. It seems that parser attempt to match (?i)</(li)(\s*)> for 2245601 times.

And there are 196608 backtraces at the end of the following line. I think the parser is looking for the closing <li> tag.

<li> [[Daniel Anrig|Daniel Rolf Anrig]] - SG (2008-2015) <-- here

I think this is related to missing </li> tags in your given wikitext.

This is a pathological case. MW seems to complete the closing </li> tag for you when render wikitext, and perhaps we need to take a similar behavior during parsing.

Parsed 892 characters in 8.316203 sec.
Regex Matching
    Count  Elpsd ms   Average ms Expression
         1     0.2494     0.2494 /?>|[\s=]
         1     0.0018     0.0018 [>"]|/>
    296303   297.6392      0.001 (?i)</(ol)(\s*)>
    2245601  1798.4757     0.0008 (?i)</(li)(\s*)>
     78288    42.6093     0.0005 \||\n|\[\[|\]\]
     41424    23.3913     0.0005 \n|\[\[|\]\]
    2838211  1127.1949     0.0003 \n

Fallbacks: 1499693
    Count  Position Text
    196608    818  - SG (2008-2015)\r\n<li> [[Chri
    196608    841  [[Christoph Graf]] - LU (vana
    196608    860  - LU (vanaf 2015)\r\n</li></ol>
    114688    779 <li> [[Daniel Anrig|Daniel Rol
     98304    783  [[Daniel Anrig|Daniel Rolf An
     98304    837 <li> [[Christoph Graf]] - LU (
     65536    784 [[Daniel Anrig|Daniel Rolf Anr
     65536    842 [[Christoph Graf]] - LU (vanaf
     57344    736 <li> Elmar Theodor M?der - SG
     49152    722  (1998-2002)\r\n<li> Elmar Theod
     49152    740  Elmar Theodor M?der - SG (200
     32768    786 Daniel Anrig|Daniel Rolf Anrig
     32768    799 Daniel Rolf Anrig]] - SG (2008
     32768    844 Christoph Graf]] - LU (vanaf 2
     32767    885 </ol>\r\n
     28672    672 <li> Pius Segmüller - [[Sankt
     24576    653  - LU (1998-1998)\r\n<li> Pius S
     24576    676  Pius Segmüller - [[Sankt Gall
     16384    694 [[Sankt Gallen (kanton)|SG]] (
     14336    629 <li> [[Alois Estermann]] - LU
     12288    633  [[Alois Estermann]] - LU (199
      8192    634 [[Alois Estermann]] - LU (1998
      8192    696 Sankt Gallen (kanton)|SG]] (19
      8192    718 SG]] (1998-2002)\r\n<li> Elmar T
      7168    577 <li> Franz Pfyffer von Altisho
      6144    581  Franz Pfyffer von Altishofen
      4096    636 Alois Estermann]] - LU (1998-1
      3584    539 <li> Robert Nünlist - LU (1957
      3072    543  Robert Nünlist - LU (1957-197
      1792    484 <li> Heinrich Pfyffer von Alti
      1536    470  (1935-1942)\r\n<li> Heinrich Pf
      1536    488  Heinrich Pfyffer von Altishof
       896    411 <li> Georg von Sury d’Aspremon
       768    397  (1921-1935)\r\n<li> Georg von S
       768    415  Georg von Sury d’Aspremont -
       512    445 [[Solothurn (kanton)|SO]] (193
       448    356 <li> Alois Hirschbühl - [[Grau
       384    342  (1910-1921)\r\n<li> Alois Hirsc
       384    360  Alois Hirschbühl - [[Graubünd
       256    380 [[Graubünden|GR]] (1921-1935)\r
       256    447 Solothurn (kanton)|SO]] (1935-
       256    466 SO]] (1935-1942)\r\n<li> Heinric
       224    298 <li> Jules Repond - [[Fribourg
       192    302  Jules Repond - [[Fribourg (ka
       128    318 [[Fribourg (kanton)|FR]] (1910
       128    382 Graubünden|GR]] (1921-1935)\r\n<
       128    393 GR]] (1921-1935)\r\n<li> Georg v
       112    246 <li> Leopold Meyer von Schauen
        96    232  (1878-1901)\r\n<li> Leopold Mey
        96    250  Leopold Meyer von Schauensee
Parsed tree
.Wikitext            [<ol start="18">\r\n<li]
..Paragraph          [<ol start="18">\r\n<li]
...HtmlTag           [<ol start="18">\r\n<li]
....TagAttribute     [ start="18"]
.....Run             [start]
......PlainText      [start]
.....Wikitext        [18]
......Paragraph      [18]
.......PlainText     [18]
....Wikitext         [\r\n<li> Martin Pfyffe]
.....Paragraph       [\r\n<li> Martin Pfyffe]
......PlainText      [\r\n<li> Martin Pfyffe]
......WikiLink       [[[Wallis (kanton)|VS]
.......Run           [Wallis (kanton)]
........PlainText    [Wallis (kanton)]
.......Run           [VS]
........PlainText    [VS]
......PlainText      [ (1878-1901)\r\n<li> L]
......WikiLink       [[[Fribourg (kanton)|]
.......Run           [Fribourg (kanton)]
........PlainText    [Fribourg (kanton)]
.......Run           [FR]
........PlainText    [FR]
......PlainText      [ (1910-1921)\r\n<li> A]
......WikiLink       [[[Graubünden|GR]]]
.......Run           [Graubünden]
........PlainText    [Graubünden]
.......Run           [GR]
........PlainText    [GR]
......PlainText      [ (1921-1935)\r\n<li> G]
......WikiLink       [[[Solothurn (kanton)]
.......Run           [Solothurn (kanton)]
........PlainText    [Solothurn (kanton)]
.......Run           [SO]
........PlainText    [SO]
......PlainText      [ (1935-1942)\r\n<li> H]
......WikiLink       [[[Alois Estermann]]]
.......Run           [Alois Estermann]
........PlainText    [Alois Estermann]
......PlainText      [ - LU (1998-1998)\r\n<]
......WikiLink       [[[Sankt Gallen (kant]
.......Run           [Sankt Gallen (kanton]
........PlainText    [Sankt Gallen (kanton]
.......Run           [SG]
........PlainText    [SG]
......PlainText      [ (1998-2002)\r\n<li> E]
......WikiLink       [[[Daniel Anrig|Danie]
.......Run           [Daniel Anrig]
........PlainText    [Daniel Anrig]
.......Run           [Daniel Rolf Anrig]
........PlainText    [Daniel Rolf Anrig]
......PlainText      [ - SG (2008-2015)\r\n]
......HtmlTag        [<li> [[Christoph Gra]
.......Wikitext      [ [[Christoph Graf]] ]
........Paragraph    [ [[Christoph Graf]] ]
.........PlainText   [ ]
.........WikiLink    [[[Christoph Graf]]]
..........Run        [Christoph Graf]
...........PlainText [Christoph Graf]
.........PlainText   [ - LU (vanaf 2015)\r\n]
...PlainText         [\r\n]
RobSchoenaker commented 4 years ago

That's exactly what I thought. This is from an existing Wiki article. I understand it's not a normal case, but preventing CPU hogging would be OK her I suppose. Can be fixed?

CXuesong commented 4 years ago

There are actually two issues here

CXuesong commented 4 years ago

So, what I can do here are

CXuesong commented 4 years ago

Btw, I did a check on mwparserfromhell 0.5.4, it treats unclosed HTML tags as plaintext. So we probably can beat them by resolving this bug 😂

image

RobSchoenaker commented 4 years ago

That looks like a very legitimate fix to me 👍

CXuesong commented 4 years ago

Fixed and published in v0.3.0-int.1.

CXuesong commented 4 years ago

ParserTracer output

Parsed 887 characters in 0.0337111 sec.
Regex Matching
    Count  Elpsd ms   Average ms Expression
         1     0.2493     0.2493 /?>|[\s=]
        39     0.1768     0.0045 (?i)</(ol)(\s*)>
         6     0.0111     0.0018 \n|\[\[|\]\]
        38     0.0491     0.0012 </li\s*>|<li(\s*>|\s+)
         8     0.0101     0.0012 \||\n|\[\[|\]\]
         1     0.0011     0.0011 [>"]|/>
       120     0.0912     0.0007 \n

Fallbacks: 219
    Count  Position Text
         6      4 start="18">\r\n<li> Martin Pfyff
         6     11 18">\r\n<li> Martin Pfyffer von
         6     15 \r\n<li> Martin Pfyffer von Alti
         6     21  Martin Pfyffer von Altishofen
         6     74  Franz Xaver Leopold Meyer von
         6    138  Alfred von Sonnenberg - LU (1
         6    183  Louis-Martin de Courten - [[W
         6    232  (1878-1901)\r\n<li> Leopold Mey
         6    250  Leopold Meyer von Schauensee
         6    302  Jules Repond - [[Fribourg (ka
         6    342  (1910-1921)\r\n<li> Alois Hirsc
         6    360  Alois Hirschbühl - [[Graubünd
         6    397  (1921-1935)\r\n<li> Georg von S
         6    415  Georg von Sury d’Aspremont -
         6    470  (1935-1942)\r\n<li> Heinrich Pf
         6    488  Heinrich Pfyffer von Altishof
         6    543  Robert Nünlist - LU (1957-197
         6    581  Franz Pfyffer von Altishofen
         6    633  [[Alois Estermann]] - LU (199
         6    653  - LU (1998-1998)\r\n<li> Pius S
         6    676  Pius Segmüller - [[Sankt Gall
         6    722  (1998-2002)\r\n<li> Elmar Theod
         6    740  Elmar Theodor M?der - SG (200
         6    783  [[Daniel Anrig|Daniel Rolf An
         6    818  - SG (2008-2015)\r\n<li> [[Chri
         6    841  [[Christoph Graf]] - LU (vana
         6    860  - LU (vanaf 2015)\r\n</ol>\r\n
         6    885 \r\n
         3     17 <li> Martin Pfyffer von Altish
         2      0 <ol start="18">\r\n<li> Martin P
         2    210 [[Wallis (kanton)|VS]] (1878-1
         2    318 [[Fribourg (kanton)|FR]] (1910
         2    380 [[Graubünden|GR]] (1921-1935)\r
         2    445 [[Solothurn (kanton)|SO]] (193
         2    634 [[Alois Estermann]] - LU (1998
         2    694 [[Sankt Gallen (kanton)|SG]] (
         2    784 [[Daniel Anrig|Daniel Rolf Anr
         2    842 [[Christoph Graf]] - LU (vanaf
         1     10 "18">\r\n<li> Martin Pfyffer von
         1     70 <li> Franz Xaver Leopold Meyer
         1    134 <li> Alfred von Sonnenberg - L
         1    179 <li> Louis-Martin de Courten -
         1    212 Wallis (kanton)|VS]] (1878-190
         1    228 VS]] (1878-1901)\r\n<li> Leopold
         1    246 <li> Leopold Meyer von Schauen
         1    298 <li> Jules Repond - [[Fribourg
         1    320 Fribourg (kanton)|FR]] (1910-1
         1    338 FR]] (1910-1921)\r\n<li> Alois H
         1    356 <li> Alois Hirschbühl - [[Grau
         1    382 Graubünden|GR]] (1921-1935)\r\n<
Parsed tree
.Wikitext            [<ol start="18">\r\n<li]
..Paragraph          [<ol start="18">\r\n<li]
...HtmlTag           [<ol start="18">\r\n<li]
....TagAttribute     [ start="18"]
.....Run             [start]
......PlainText      [start]
.....Wikitext        [18]
......Paragraph      [18]
.......PlainText     [18]
....Wikitext         [\r\n<li> Martin Pfyffe]
.....Paragraph       [\r\n<li> Martin Pfyffe]
......PlainText      [\r\n]
......HtmlTag        [<li> Martin Pfyffer ]
.......Wikitext      [ Martin Pfyffer von ]
........Paragraph    [ Martin Pfyffer von ]
.........PlainText   [ Martin Pfyffer von ]
......HtmlTag        [<li> Franz Xaver Leo]
.......Wikitext      [ Franz Xaver Leopold]
........Paragraph    [ Franz Xaver Leopold]
.........PlainText   [ Franz Xaver Leopold]
......HtmlTag        [<li> Alfred von Sonn]
.......Wikitext      [ Alfred von Sonnenbe]
........Paragraph    [ Alfred von Sonnenbe]
.........PlainText   [ Alfred von Sonnenbe]
......HtmlTag        [<li> Louis-Martin de]
.......Wikitext      [ Louis-Martin de Cou]
........Paragraph    [ Louis-Martin de Cou]
.........PlainText   [ Louis-Martin de Cou]
.........WikiLink    [[[Wallis (kanton)|VS]
..........Run        [Wallis (kanton)]
...........PlainText [Wallis (kanton)]
..........Run        [VS]
...........PlainText [VS]
.........PlainText   [ (1878-1901)\r\n]
......HtmlTag        [<li> Leopold Meyer v]
.......Wikitext      [ Leopold Meyer von S]
........Paragraph    [ Leopold Meyer von S]
.........PlainText   [ Leopold Meyer von S]
......HtmlTag        [<li> Jules Repond - ]
.......Wikitext      [ Jules Repond - [[Fr]
........Paragraph    [ Jules Repond - [[Fr]
.........PlainText   [ Jules Repond - ]
.........WikiLink    [[[Fribourg (kanton)|]
..........Run        [Fribourg (kanton)]
...........PlainText [Fribourg (kanton)]
..........Run        [FR]
...........PlainText [FR]
.........PlainText   [ (1910-1921)\r\n]
......HtmlTag        [<li> Alois Hirschbüh]
.......Wikitext      [ Alois Hirschbühl - ]
........Paragraph    [ Alois Hirschbühl - ]
.........PlainText   [ Alois Hirschbühl - ]
.........WikiLink    [[[Graubünden|GR]]]
..........Run        [Graubünden]
...........PlainText [Graubünden]
..........Run        [GR]
...........PlainText [GR]
.........PlainText   [ (1921-1935)\r\n]
......HtmlTag        [<li> Georg von Sury ]
.......Wikitext      [ Georg von Sury d’As]
........Paragraph    [ Georg von Sury d’As]
.........PlainText   [ Georg von Sury d’As]
.........WikiLink    [[[Solothurn (kanton)]
..........Run        [Solothurn (kanton)]
...........PlainText [Solothurn (kanton)]
..........Run        [SO]
...........PlainText [SO]
.........PlainText   [ (1935-1942)\r\n]
......HtmlTag        [<li> Heinrich Pfyffe]
.......Wikitext      [ Heinrich Pfyffer vo]
........Paragraph    [ Heinrich Pfyffer vo]
.........PlainText   [ Heinrich Pfyffer vo]
......HtmlTag        [<li> Robert Nünlist ]
.......Wikitext      [ Robert Nünlist - LU]
........Paragraph    [ Robert Nünlist - LU]
.........PlainText   [ Robert Nünlist - LU]
......HtmlTag        [<li> Franz Pfyffer v]
.......Wikitext      [ Franz Pfyffer von A]
........Paragraph    [ Franz Pfyffer von A]
.........PlainText   [ Franz Pfyffer von A]
......HtmlTag        [<li> [[Alois Esterma]
.......Wikitext      [ [[Alois Estermann]]]
........Paragraph    [ [[Alois Estermann]]]
.........PlainText   [ ]
.........WikiLink    [[[Alois Estermann]]]
..........Run        [Alois Estermann]
...........PlainText [Alois Estermann]
.........PlainText   [ - LU (1998-1998)\r\n]
......HtmlTag        [<li> Pius Segmüller ]
.......Wikitext      [ Pius Segmüller - [[]
........Paragraph    [ Pius Segmüller - [[]
.........PlainText   [ Pius Segmüller - ]
.........WikiLink    [[[Sankt Gallen (kant]
..........Run        [Sankt Gallen (kanton]
...........PlainText [Sankt Gallen (kanton]
..........Run        [SG]
...........PlainText [SG]
.........PlainText   [ (1998-2002)\r\n]
......HtmlTag        [<li> Elmar Theodor M]
.......Wikitext      [ Elmar Theodor M?der]
........Paragraph    [ Elmar Theodor M?der]
.........PlainText   [ Elmar Theodor M?der]
......HtmlTag        [<li> [[Daniel Anrig|]
.......Wikitext      [ [[Daniel Anrig|Dani]
........Paragraph    [ [[Daniel Anrig|Dani]
.........PlainText   [ ]
.........WikiLink    [[[Daniel Anrig|Danie]
..........Run        [Daniel Anrig]
...........PlainText [Daniel Anrig]
..........Run        [Daniel Rolf Anrig]
...........PlainText [Daniel Rolf Anrig]
.........PlainText   [ - SG (2008-2015)\r\n]
......HtmlTag        [<li> [[Christoph Gra]
.......Wikitext      [ [[Christoph Graf]] ]
........Paragraph    [ [[Christoph Graf]] ]
.........PlainText   [ ]
.........WikiLink    [[[Christoph Graf]]]
..........Run        [Christoph Graf]
...........PlainText [Christoph Graf]
.........PlainText   [ - LU (vanaf 2015)\r\n]
...PlainText         [\r\n]
RobSchoenaker commented 4 years ago

WOW! That's quite an improvement. Will test tonight.