chatnoir-eu / chatnoir-resiliparse

A robust web archive analytics toolkit
https://resiliparse.chatnoir.eu
Apache License 2.0
84 stars 14 forks source link

DOM Tree Manipulation and DOMNode #41

Closed rosinality closed 7 months ago

rosinality commented 7 months ago

Hello, Thank you for the wonderful project!

I have a question about DOM Manipulation and DOM Node. In the document, there are warnings against use of instance of DOMNode after DOM Tree Manipulation.

Warning

A DOMNode object is valid only for as > long as its parent tree has not been modified or deallocated. Thus, DO NOT use existing instances after any sort of DOM tree manipulation! Doing so may result in Python crashes or (worse) security vulnerabilities due to dangling pointers (use after free). This is a known Lexbor limitation for which there is no workaround at the moment.

I am currently working on creating HTML extractor, and there are many DOM manipulations and DOMNode accesses, for example, like this:

sibling = next_sibling.next
p.append_child(next_sibling)
next_sibling = sibling

I think if I need to re-find DOMNode again for every DOM manipulation operations it will make it hard to do some kind of works. Is there are a concrete example of safe or okay manipulations/accesses or a specific cases where accessing after manipulation will cause error or segfault? Thank you!

phoerious commented 7 months ago

Appending children should be safe, but anything that would destroy nodes is not. Do not use references to existing nodes if any of its parent nodes has been deleted (either explictly or implicitly by setting inner/outerHTML).

rosinality commented 7 months ago

Thank you! If I only need to avoid accessing the child element after removing its parent then I think it maybe not very problematic for these kind of DOM manipulations.

phoerious commented 7 months ago

Be aware that the deallocation of any node does not only affect the immediate descendant but the whole subtree. remove_node() is fine as long as you keep the reference around and insert it back into the tree, but once you lose that, the entire subtree is gone. Same goes for explicit deletion or setting of innerHTML, innerText, outerHTML, outerText as mentioned before.

rosinality commented 7 months ago

Thank you for a more information! Regarding to this problem, I commonly doing replacing tags (for example, from 'font' tag to 'span' tags) like this:

<font>
  A
  <font>
    B
    <font>
    C
    </font>
  </font>
  D
</font>
<font>
  1
  <font>
  2
  </font>
  3
</font>
def replace_node_tags(doc, nodes, new_tag):
    for node in nodes:
        new_node = doc.create_element(new_tag)
        for child in node.child_nodes:
            new_node.append_child(child)
        node.parent.replace_child(new_node, node)

replace_node_tags(doc, doc.document.get_elements_by_tag_name('font'), 'span')

In this case, for each loop for node in nodes childs of each node will be appended to new node and newly created node will replace old node. As tags are nested so during the loop children nodes are moved to newly created nodes (parent font node of children font nodes is replaced to span node), or child nodes replaced to newly created nodes (font children node replaced with newly created span node). Would this okay? Thank you!

phoerious commented 7 months ago

Looks ok. You cannot use node or any of the elements in nodes afterwards, though in this case it shouldn't even be a problem. replace_child() should properly invalidate the reference. If you try accessing it, you should just get an error that the node is invalid. You would only run into trouble if you had obtained another independent reference to the same node beforehand, which wouldn't be invalidated automatically.

rosinality commented 7 months ago

Thank you very much!