5j9 / wikitextparser

A Python library to parse MediaWiki WikiText
GNU General Public License v3.0
289 stars 22 forks source link

Error when calling plain_text() missing "end" tag #137

Closed MariusArhaug closed 4 months ago

MariusArhaug commented 4 months ago

When parsing the string within the <text> element from a wikidump, with the plain_text(), the following error is displayed:

Error parsing text: 'NoneType' object has no attribute 'end'

Example

text = """Text: ''[https://<!---->{{#switch:{{{3|{{{type|movie}}}}}}<!-- the parameter"type"is"movie"by default -->|movie=movie.douban.com/subject/{{{1|{{{id|{{#if:{{#property:P4529}}|{{#property:P4529}}|}}}}}}}}|book=book.douban.com/subject/{{{1|{{{id|}}}}}}|music=music.douban.com/subject/{{{1|{{{id|}}}}}}|www.douban.com/{{{3|{{{type|}}}}}}/{{{1|{{{id|}}}}}}<!-- default -->}}/<!---->{{#if:{{{2|{{{title|}}}}}}|{{{2|{{{title|}}}}}}|{{PAGENAMEBASE}}}}]''<!-- the parameter"title"is the current Wikipedia page's title by default-->at [[Douban]] {{in lang|zh}}<includeonly>{{#switch:{{{3|{{{type|movie}}}}}}|movie={{EditAtWikidata|pid=P4529|{{{1|{{{id|}}}}}}}}{{#if:{{{1|{{{id|}}}}}}{{#property:P4529}}||{{main other|[[Category:Douban template with no id set]]}}}}|}}</includeonly><noinclude>{{Documentation}}</noinclude>"""

parsed = wtp.parse(text)
plain_text = parsed.plain_text()

Error parsing text: 'NoneType' object has no attribute 'end'

Could it be that this singular dump is just formatted wrong, or that this is an edge case?

Wikimedia dump

<text bytes="821" xml:space="preserve">
        ''[https://&lt;!--
          --&gt;{{#switch:{{{3|{{{type|movie}}}}}}&lt;!-- the parameter &quot;type&quot; is &quot;movie&quot; by default --&gt;
          |movie=movie.douban.com/subject/{{{1|{{{id|{{#if:{{#property:P4529}}|{{#property:P4529}}|}}}}}}}}
          |book=book.douban.com/subject/{{{1|{{{id|}}}}}}
          |music=music.douban.com/subject/{{{1|{{{id|}}}}}}
          |www.douban.com/{{{3|{{{type|}}}}}}/{{{1|{{{id|}}}}}}&lt;!-- default --&gt;
          }}/&lt;!--
          --&gt; {{#if:{{{2|{{{title|}}}}}}|{{{2|{{{title|}}}}}}|{{PAGENAMEBASE}}}}]''&lt;!-- the parameter &quot;title&quot; is the current Wikipedia page's title by default
          --&gt; at [[Douban]] {{in lang|zh}}&lt;includeonly&gt;{{#switch:{{{3|{{{type|movie}}}}}}|movie={{EditAtWikidata|pid=P4529|{{{1|{{{id|}}}}}}}}{{#if:{{{1|{{{id|}}}}}}{{#property:P4529}}||{{main other|[[Category:Douban template with no id set]]}}}}|}}&lt;/includeonly&gt;&lt;noinclude&gt;{{Documentation}}&lt;/noinclude&gt;
</text>