CXuesong / MwParserFromScratch

A basic .NET Library for parsing wikitext into AST.
Apache License 2.0
18 stars 5 forks source link

Infinite loop when parsing `Wikipedia:Paramount Pictures` #3

Closed CXuesong closed 7 years ago

CXuesong commented 7 years ago

Paramount Pictures Corporation.txt

CXuesong commented 7 years ago

Maybe the loop is not infinite. Rather, there might be too many fallbacks.

The leading part of the content is

{{pp-move-vandalism|small=yes}}
{{pp-semi-sock|small=yes|expiry=January 8, 2017}}
{{Use mdy dates|date=December 2014}}
{{Infobox company
|name = Paramount Pictures Corporation
|logo = [[File:Paramount Pictures logo (2013).jpg|250px]]
|type = [[Subsidiary]]
|area_served = Worldwide
|key_people =[[Brad Grey]]<br /><small>(Chairman and CEO)
|industry = Film
|products = Motion pictures
|services =
|revenue = {{decrease}} $2.885 billion (FY 2015)<ref name="earnings sheet">http://files.shareholder.com/downloads/VIA-B/1490803206x0x861120/397CD610-0A7A-4E1C-BCB0-E1840B6AE9AB/Viacom_Q4_15_Earnings_Release.pdf VIACOM REPORTS FOURTH QUARTER AND FULL YEAR FINANCIAL RESULTS, p.3.</ref>
|operating_income = {{decrease}} $111 million (FY 2015)<ref name="earnings sheet"/>
|net_income =
|aum =
|assets =
|equity =
|owner = [[National Amusements]] 
|num_employees =
|parent = [[Viacom]] 
|divisions = Current:'''<br>[[Paramount Home Media Distribution]]<br>[[Insurge Pictures]]<br>[[Paramount Famous Productions]]<br>[[Paramount Vantage]]<br>[[Paramount Animation]]<br>[[Paramount Television]]<br>[[MTV Films]]<br>[[Nickelodeon Movies]]<br>[[Comedy Central Films]]<br>[[United International Pictures]] (50%)<br>'''Former:'''<br>[[Paramount Parks]]<br>[[

]]
|subsid =[[Republic Pictures|Melange Pictures, LLC]]
|homepage = {{URL|www.paramount.com}}
|footnotes =
|intl = 
|caption = Logo used as of 2012
|foundation = {{start date and age|1912|5|8}} (as [[Famous Players Film Company]])
|founders = [[W. W. Hodkinson]]<br />[[Adolph Zukor]]<br />[[Jesse L. Lasky]]
|location_city = [[Hollywood, California]]
|location_country = <br />United States
|locations =
}}

'''Paramount Pictures Corporation''' (commonly known as '''Paramount Studios''' or simply '''Paramount''', and formerly known as '''Famous Players-Lasky Corporation''') is an American [[film studio]], [[Production company|television production company]] and [[Film distribution|motion picture distributor]], consistently ranked as one of the [[Major film studio|"Big Six" film studios]] of [[Hollywood]]. It is a subsidiary of U.S. [[media conglomerate]] [[Viacom]]. Paramount is a member of the [[Motion Picture Association of America]] (MPAA).<ref>{{cite web |title=Motion Picture Association of America – About Us |url=http://www.mpaa.org/about |publisher=MPAA |accessdate=May 27, 2012}}</ref>

In 2014, Paramount Pictures became the first major Hollywood studio to distribute all of its films in digital-form only.<ref>{{cite web |url=http://www.engadget.com/2014/01/19/paramount-all-digital-movie-distribution/ |title=Paramount now releases movies only in digital form |first=Jon|last=Fingas |date=January 19, 2014}}</ref>

Paramount is the fifth oldest surviving film studio in the world,<ref name="Abel10">{{Cite book|author = Richard Abel|title= The Ciné Goes to Town: French Cinema, 1896–1914|publisher= University of California Press|date= 1994|page= 10 |ISBN= 0-520-07936-1}}</ref> the second oldest surviving film studio in the United States, and the sole member of the "Big Six" major film studios which is still located in the [[Los Angeles]] neighborhood of Hollywood.

which used 3 sec to parse on my PC.

CXuesong commented 7 years ago

Okay now I've shrinked the suspectable region to

{{Infobox company
|divisions = Current:'''<br>[[Paramount Home Media Distribution]]<br>[[Insurge Pictures]]<br>[[Paramount Famous Productions]]<br>[[Paramount Vantage]]<br>[[Paramount Animation]]<br>[[Paramount Television]]<br>[[MTV Films]]<br>[[Nickelodeon Movies]]<br>[[Comedy Central Films]]<br>[[United International Pictures]] (50%)<br>'''Former:'''<br>[[Paramount Parks]]<br>[[

]]
|subsid =[[Republic Pictures|Melange Pictures, LLC]]
}}

With the not-long-ago-made tracer, I have

Parsed 450 characters in 1.1381805 sec.
Regex Matching
    Count  Elpsd ms   Average ms Expression
         1     0.0003     0.0003 \n|\[\[|\]\]
         2     0.0004     0.0002 =
     50182     9.5917     0.0001 \}\}|\|
     14335     2.2704     0.0001 \||\n|\[\[|\]\]
    408589    36.4185          0 \n
    308226    25.8915          0 </(br)(\s*)>

Fallbacks: 295974
    Count  Position Text
     65536    383 [\r\n\r\n]]\r\n|subsid =[[Republic
     49153    388 ]]\r\n|subsid =[[Republic Pictur
     32768    384 \r\n\r\n]]\r\n|subsid =[[Republic P
     28672    382 [[\r\n\r\n]]\r\n|subsid =[[Republic
     16384    378 <br>[[\r\n\r\n]]\r\n|subsid =[[Repu
     14336    359 [[Paramount Parks]]<br>[[\r\n\r\n]
     11264    342 '''Former:'''<br>[[Paramount P
     10240    345 Former:'''<br>[[Paramount Park
     10240    352 '''<br>[[Paramount Parks]]<br>
      8192    355 <br>[[Paramount Parks]]<br>[[\r
      8192    361 Paramount Parks]]<br>[[\r\n\r\n]]\r
      8192    385 \n\r\n]]\r\n|subsid =[[Republic Pic
      8191    392 |subsid =[[Republic Pictures|M
      5120    332  (50%)<br>'''Former:'''<br>[[P
      4096    338 <br>'''Former:'''<br>[[Paramou
      3584    299 [[United International Picture
      2048    295 <br>[[United International Pic
      2048    301 United International Pictures]
      1792    271 [[Comedy Central Films]]<br>[[
      1024    267 <br>[[Comedy Central Films]]<b
      1024    273 Comedy Central Films]]<br>[[Un
       896    245 [[Nickelodeon Movies]]<br>[[Co
       512    241 <br>[[Nickelodeon Movies]]<br>
       512    247 Nickelodeon Movies]]<br>[[Come
       448    228 [[MTV Films]]<br>[[Nickelodeon
       256    224 <br>[[MTV Films]]<br>[[Nickelo
       256    230 MTV Films]]<br>[[Nickelodeon M
       224    200 [[Paramount Television]]<br>[[
       128    196 <br>[[Paramount Television]]<b
       128    202 Paramount Television]]<br>[[MT
       112    173 [[Paramount Animation]]<br>[[P
        64    169 <br>[[Paramount Animation]]<br
        64    175 Paramount Animation]]<br>[[Par
        56    148 [[Paramount Vantage]]<br>[[Par
        32    144 <br>[[Paramount Vantage]]<br>[
        32    150 Paramount Vantage]]<br>[[Param
        28    112 [[Paramount Famous Productions
        16    108 <br>[[Paramount Famous Product
        16    114 Paramount Famous Productions]]
        14     88 [[Insurge Pictures]]<br>[[Para
         8     84 <br>[[Insurge Pictures]]<br>[[
         8     90 Insurge Pictures]]<br>[[Paramo
         7     47 [[Paramount Home Media Distrib
         6     20 divisions = Current:'''<br>[[P
         6    386 \r\n]]\r\n|subsid =[[Republic Pict
         6    393 subsid =[[Republic Pictures|Me
         5     32 Current:'''<br>[[Paramount Hom
         5     40 '''<br>[[Paramount Home Media
         5    444 \r\n}}\r\n
         5    448 \r\n

Obviously the parser got stuck at offset 383 . There're too many fallbacks.

CXuesong commented 7 years ago

The fallbacks are mostly caused by <br>. The parser recognizes it as an opening tag, and parses ahead. But it fails to find a closing tag </br>, and falls back, leaving <br> as plain text. Consider the following case:

a<br>b<br>c<br>d<br>e<br>

The last <br> will be parsed over and over again, because the first <br> will be tried as an opening tag first, but failed, so is the next several <br>s. Note that <br /> will not have this problem, because it's self-closing.

CXuesong commented 7 years ago

So now the parser knows that <br>, <hr>, and <wbr> should be treated with special care. They're specified in WikitextParserOptions.SelfClosingOnlyTags. Parser will not look for the closing tag for these tags.

Plus, the parser will do a simple look-ahead before actually start parsing from an HTML tag. This will reduce the chances of falling back.

Now with the same test case, we have

Parsed 450 characters in 0.0621298 sec.
Regex Matching
    Count  Elpsd ms   Average ms Expression
        26     0.1791     0.0068 \}\}|\|
         1     0.0008     0.0008 \n|\[\[|\]\]
        51     0.0353     0.0006 \n
        13     0.0083     0.0006 \||\n|\[\[|\]\]
         2     0.0008     0.0004 =

Fallbacks: 177
    Count  Position Text
         8    383 [\r\n\r\n]]\r\n|subsid =[[Republic P
         7    388 ]]\r\n|subsid =[[Republic Pictur
         6     20 divisions = Current:'''<br>[[P
         6    386 \r\n]]\r\n|subsid =[[Republic Pict
         6    393 subsid =[[Republic Pictures|Me
         5     32 Current:'''<br>[[Paramount Hom
         5     40 '''<br>[[Paramount Home Media
         5    332  (50%)<br>'''Former:'''<br>[[P
         5    342 '''Former:'''<br>[[Paramount P
         5    345 Former:'''<br>[[Paramount Park
         5    352 '''<br>[[Paramount Parks]]<br>
         5    444 \r\n}}\r\n
         5    448 \r\n
         4    384 \r\n\r\n]]\r\n|subsid =[[Republic Pi
         4    401 [[Republic Pictures|Melange Pi
         3     47 [[Paramount Home Media Distrib
         3     88 [[Insurge Pictures]]<br>[[Para
         3    112 [[Paramount Famous Productions
         3    148 [[Paramount Vantage]]<br>[[Par
         3    173 [[Paramount Animation]]<br>[[P
         3    200 [[Paramount Television]]<br>[[
         3    228 [[MTV Films]]<br>[[Nickelodeon
         3    245 [[Nickelodeon Movies]]<br>[[Co
         3    271 [[Comedy Central Films]]<br>[[
         3    299 [[United International Picture
         3    359 [[Paramount Parks]]<br>[[\r\n\r\n]
         3    382 [[\r\n\r\n]]\r\n|subsid =[[Republic
         2      0 {{Infobox company\r\n|divisions
         2      2 Infobox company\r\n|divisions =
         2     18 \n|divisions = Current:'''<br>[
         2     43 <br>[[Paramount Home Media Dis
         2     49 Paramount Home Media Distribut
         2     84 <br>[[Insurge Pictures]]<br>[[
         2     90 Insurge Pictures]]<br>[[Paramo
         2    108 <br>[[Paramount Famous Product
         2    114 Paramount Famous Productions]]
         2    144 <br>[[Paramount Vantage]]<br>[
         2    150 Paramount Vantage]]<br>[[Param
         2    169 <br>[[Paramount Animation]]<br
         2    175 Paramount Animation]]<br>[[Par
         2    196 <br>[[Paramount Television]]<b
         2    202 Paramount Television]]<br>[[MT
         2    224 <br>[[MTV Films]]<br>[[Nickelo
         2    230 MTV Films]]<br>[[Nickelodeon M
         2    241 <br>[[Nickelodeon Movies]]<br>
         2    247 Nickelodeon Movies]]<br>[[Come
         2    267 <br>[[Comedy Central Films]]<b
         2    273 Comedy Central Films]]<br>[[Un
         2    295 <br>[[United International Pic
         2    301 United International Pictures]
Parsed tree
.Wikitext            [{{Infobox company\r\n|]
..Paragraph          [{{Infobox company\r\n|]
...Template          [{{Infobox company\r\n|]
....Run              [Infobox company\r\n]
.....PlainText       [Infobox company\r\n]
....TemplateArgument [divisions = Current:]
.....Wikitext        [divisions ]
......Paragraph      [divisions ]
.......PlainText     [divisions ]
.....Wikitext        [ Current:'''<br>[[Pa]
......ListItem       [ Current:'''<br>[[Pa]
.......PlainText     [Current:]
.......FormatSwitch  [''']
.......HtmlTag       [<br>]
.......WikiLink      [[[Paramount Home Med]
........Run          [Paramount Home Media]
.........PlainText   [Paramount Home Media]
.......HtmlTag       [<br>]
.......WikiLink      [[[Insurge Pictures]]]
........Run          [Insurge Pictures]
.........PlainText   [Insurge Pictures]
.......HtmlTag       [<br>]
.......WikiLink      [[[Paramount Famous P]
........Run          [Paramount Famous Pro]
.........PlainText   [Paramount Famous Pro]
.......HtmlTag       [<br>]
.......WikiLink      [[[Paramount Vantage]]
........Run          [Paramount Vantage]
.........PlainText   [Paramount Vantage]
.......HtmlTag       [<br>]
.......WikiLink      [[[Paramount Animatio]
........Run          [Paramount Animation]
.........PlainText   [Paramount Animation]
.......HtmlTag       [<br>]
.......WikiLink      [[[Paramount Televisi]
........Run          [Paramount Television]
.........PlainText   [Paramount Television]
.......HtmlTag       [<br>]
.......WikiLink      [[[MTV Films]]]
........Run          [MTV Films]
.........PlainText   [MTV Films]
.......HtmlTag       [<br>]
.......WikiLink      [[[Nickelodeon Movies]
........Run          [Nickelodeon Movies]
.........PlainText   [Nickelodeon Movies]
.......HtmlTag       [<br>]
.......WikiLink      [[[Comedy Central Fil]
........Run          [Comedy Central Films]
.........PlainText   [Comedy Central Films]
.......HtmlTag       [<br>]
.......WikiLink      [[[United Internation]
........Run          [United International]
.........PlainText   [United International]
.......PlainText     [ (50%)]
.......HtmlTag       [<br>]
.......FormatSwitch  [''']
.......PlainText     [Former:]
.......FormatSwitch  [''']
.......HtmlTag       [<br>]
.......WikiLink      [[[Paramount Parks]]]
........Run          [Paramount Parks]
.........PlainText   [Paramount Parks]
.......HtmlTag       [<br>]
.......PlainText     [[[\r]
......Paragraph      [\r\n]]\r\n]
.......PlainText     [\r\n]]\r\n]
....TemplateArgument [subsid =[[Republic P]
.....Wikitext        [subsid ]
......Paragraph      [subsid ]
.......PlainText     [subsid ]
.....Wikitext        [[[Republic Pictures|]
......Paragraph      [[[Republic Pictures|]
.......WikiLink      [[[Republic Pictures|]
........Run          [Republic Pictures]
.........PlainText   [Republic Pictures]
........Run          [Melange Pictures, LL]
.........PlainText   [Melange Pictures, LL]
.......PlainText     [\r\n]
...PlainText         [\r\n]