earwig / mwparserfromhell

A Python parser for MediaWiki wikicode
https://mwparserfromhell.readthedocs.io/
MIT License
741 stars 74 forks source link

Bug: Table attributes are parsed as Text #285

Open Eccenux opened 2 years ago

Eccenux commented 2 years ago

There seem to be something weird going on when parsing Tables. I get attributes as Text nodes.

Let's consider this example:

import mwparserfromhell
text = """
{| class="wikitable"
|-
! style="width:101px" | heading1
! style="width:102px" | heading2
|- data-test="whatever"
| style="width:201px" | testing
|}
""".strip()
code = mwparserfromhell.parse(text)
# code.filter_tags(matches=lambda node: node.tag == "table")
# print (code.get_tree())

for node in code.filter():
    if isinstance(node, mwparserfromhell.nodes.tag.Tag):
        print(f'[{node.__class__.__name__}]({node.tag})\n"""\n', node, '\n""""')
    else:
        print(f'[{node.__class__.__name__}]\n"""\n', node, '\n""""')

This how the text is rendered by MW:

<table class="wikitable">
    <tr>
        <th style="width:101px">heading1</th>
        <th style="width:102px">heading2</th>
    </tr>
    <tr data-test="whatever">
        <td style="width:201px">testing</td>
    </tr>
</table>

So there are 3 cells here. In HTML textContent would be heading1, heading2, testing.

But in MWPFH I get Text nodes for attributes. For example:

[Tag](td)
"""
 | style="width:201px" | testing

"""

[Text]
"""
 style
"""

[Text]
"""
 width:201px
"""

[Text]
"""
  testing

"""

I would expect those attributes wouldn't appear as Text nodes at all. They are already available node.attributes so that should be enough.

So when traversing above cell, it should only show:

[Tag](td)
"""
 | style="width:201px" | testing

"""

[Text]
"""
  testing

"""
lahwaacz commented 2 years ago

You are doing a recursive filter on the wikicode. Those text nodes are nested inside the tag nodes. The tree of your snippet looks correct:

>>> print(code.get_tree())
<
      table
      class
    = wikitable
>
      <
            tr
      >
            <
                  th
                  style
                = width:101px
            >
                   heading1\n
            </
                  th
            >
            <
                  th
                  style
                = width:102px
            >
                   heading2\n
            </
                  th
            >
      </
            tr
      >
      <
            tr
            data-test
          = whatever
      >
            <
                  td
                  style
                = width:201px
            >
                   testing\n
            </
                  td
            >
      </
            tr
      >
</
      table
>

The table attributes are parsed as text because MediaWiki parses table attributes before producing the table. For example, table attributes can contain templates and other wikicode. So I would say this is not a bug in mwparserfromhell.

Eccenux commented 2 years ago

The table attributes are parsed as text because MediaWiki parses table attributes before producing the table. For example, table attributes can contain templates and other wikicode. So I would say this is not a bug in mwparserfromhell.

I don't insist in calling it a bug, but still it doesn't make sense when parsing stuff with a bot. When I want change something in texts I explicitly don't want to mess with table attributes.

lahwaacz commented 2 years ago

I don't insist in calling it a bug, but still it doesn't make sense when parsing stuff with a bot. When I want change something in texts I explicitly don't want to mess with table attributes.

Then you need to adjust your recursive iteration. For example:

for node in code.filter():
    parent = code.get_parent(node)
    if isinstance(parent, mwparserfromhell.nodes.tag.Tag) and not parent.contents.contains(node):
        print(f"parent of '{node}' is a tag and the node is not in the tag's contents, skipped")
        continue
    if isinstance(node, mwparserfromhell.nodes.tag.Tag):
        print(f'[{node.__class__.__name__}]({node.tag})\n"""\n', node, '\n""""')
    else:
        print(f'[{node.__class__.__name__}]\n"""\n', node, '\n""""')
earwig commented 2 years ago

@lahwaacz is right, but maybe there's room here for improvement. We might want to distinguish between Text nodes that are "visible"/rendered and Text nodes that are purely internal like template parameter names and tag attributes.

It's basically the problem being solved by strip_code but generalized for other use cases, to allow a way to filter VisibleText nodes or something.

lahwaacz commented 2 years ago

@earwig, :+1: because the loop like

for node in code.filter():
    parent = code.get_parent(node)

that is needed for this is not efficient as it basically iterates again for each node, c.f. #195.