Open Eccenux opened 2 years ago
You are doing a recursive filter on the wikicode. Those text nodes are nested inside the tag nodes. The tree of your snippet looks correct:
>>> print(code.get_tree())
<
table
class
= wikitable
>
<
tr
>
<
th
style
= width:101px
>
heading1\n
</
th
>
<
th
style
= width:102px
>
heading2\n
</
th
>
</
tr
>
<
tr
data-test
= whatever
>
<
td
style
= width:201px
>
testing\n
</
td
>
</
tr
>
</
table
>
The table attributes are parsed as text because MediaWiki parses table attributes before producing the table. For example, table attributes can contain templates and other wikicode. So I would say this is not a bug in mwparserfromhell.
The table attributes are parsed as text because MediaWiki parses table attributes before producing the table. For example, table attributes can contain templates and other wikicode. So I would say this is not a bug in mwparserfromhell.
I don't insist in calling it a bug, but still it doesn't make sense when parsing stuff with a bot. When I want change something in texts I explicitly don't want to mess with table attributes.
I don't insist in calling it a bug, but still it doesn't make sense when parsing stuff with a bot. When I want change something in texts I explicitly don't want to mess with table attributes.
Then you need to adjust your recursive iteration. For example:
for node in code.filter():
parent = code.get_parent(node)
if isinstance(parent, mwparserfromhell.nodes.tag.Tag) and not parent.contents.contains(node):
print(f"parent of '{node}' is a tag and the node is not in the tag's contents, skipped")
continue
if isinstance(node, mwparserfromhell.nodes.tag.Tag):
print(f'[{node.__class__.__name__}]({node.tag})\n"""\n', node, '\n""""')
else:
print(f'[{node.__class__.__name__}]\n"""\n', node, '\n""""')
@lahwaacz is right, but maybe there's room here for improvement. We might want to distinguish between Text nodes that are "visible"/rendered and Text nodes that are purely internal like template parameter names and tag attributes.
It's basically the problem being solved by strip_code
but generalized for other use cases, to allow a way to filter VisibleText nodes or something.
@earwig, :+1: because the loop like
for node in code.filter():
parent = code.get_parent(node)
that is needed for this is not efficient as it basically iterates again for each node, c.f. #195.
There seem to be something weird going on when parsing Tables. I get attributes as Text nodes.
Let's consider this example:
This how the text is rendered by MW:
So there are 3 cells here. In HTML textContent would be heading1, heading2, testing.
But in MWPFH I get Text nodes for attributes. For example:
I would expect those attributes wouldn't appear as Text nodes at all. They are already available
node.attributes
so that should be enough.So when traversing above cell, it should only show: