validate_xml doesn't handle HTML entities

sbarber2 commented 2 months ago

The Python XML package's xml.etree.ElementTree.fromstring()'s XMLParser apparently has no (easy?) way to handle predefined XHTML/HTML entities (such as nbsp).

It would be nice if we could use such entities in the XHTML templates.

May or may not be worth the effort, but noting the issue.

It's in tests/app_test.py validate_xml().

Example is in templates/tos.html -- there's an img tag whose alt string wanted a couple nbsp characters. See comment in file.

Some possible solutions are to define own XMLParser subclass (work!) or to consider switching to lxml or BeautifulSoup.

simsong commented 2 months ago

Here is what ChatGPT recommends:

import xml.etree.ElementTree as ET
from xml.parsers import expat

# Define a dictionary for your entities
entities = {
    'nbsp': ' ',
    'lt': '<',
    'gt': '>',
    'amp': '&',
    'quot': '"',
    'apos': "'",
    # Add more entities as needed
}

# Create a custom XML parser with the entity resolver
def create_parser():
    parser = expat.ParserCreate()

    # Register the entity handler
    def entity_handler(entity_name):
        return entities.get(entity_name, '')

    parser.EntityDeclHandler = entity_handler
    return parser

# Parse the string using the custom parser
def parse_xhtml(xhtml_string):
    parser = create_parser()
    tree = ET.ElementTree(ET.XML(xhtml_string, parser=parser))
    return tree

# Example XHTML string with &nbsp;
xhtml_string = "<div>&nbsp;Hello, World!&nbsp;</div>"

# Parse the XHTML string
tree = parse_xhtml(xhtml_string)

# Print the result
print(ET.tostring(tree.getroot(), encoding='unicode'))

sbarber2 commented 2 months ago

From @simsong in #507, copying here as this is where we would do this work:

Here's what ChatGPT recommends:

import xml.etree.ElementTree as ET
from xml.parsers import expat

# Define a dictionary for your entities
entities = {
    'nbsp': ' ',
    'lt': '<',
    'gt': '>',
    'amp': '&',
    'quot': '"',
    'apos': "'",
    # Add more entities as needed
}

# Create a custom XML parser with the entity resolver
def create_parser():
    parser = expat.ParserCreate()

    # Register the entity handler
    def entity_handler(entity_name):
        return entities.get(entity_name, '')

    parser.EntityDeclHandler = entity_handler
    return parser

# Parse the string using the custom parser
def parse_xhtml(xhtml_string):
    parser = create_parser()
    tree = ET.ElementTree(ET.XML(xhtml_string, parser=parser))
    return tree

# Example XHTML string with &nbsp;
xhtml_string = "<div>&nbsp;Hello, World!&nbsp;</div>"

# Parse the XHTML string
tree = parse_xhtml(xhtml_string)

# Print the result
print(ET.tostring(tree.getroot(), encoding='unicode'))

sbarber2 commented 2 months ago

Here's is ChatGPT's response to the prompt. I have not checked the response for correctness or completeness:

Generate a python dictionary whose keys are all the names of all the predefined HTML character entities and whose values are their translations as strings.

Certainly! Here's the Python dictionary with the HTML entity names as keys and their corresponding characters (translations) as values:

python
html_entity_names_to_characters = {
    "amp": "&",        # &
    "lt": "<",         # <
    "gt": ">",         # >
    "quot": '"',       # "
    "apos": "'",       # '
    "nbsp": "\u00A0",  # non-breaking space
    "copy": "\u00A9",  # ©
    "reg": "\u00AE",   # ®
    "euro": "\u20AC",  # €
    "cent": "\u00A2",  # ¢
    "pound": "\u00A3", # £
    "yen": "\u00A5",   # ¥
    "dollar": "$",     # $
    "frac14": "\u00BC",# ¼
    "frac12": "\u00BD",# ½
    "frac34": "\u00BE",# ¾
    "times": "\u00D7", # ×
    "divide": "\u00F7",# ÷
    "laquo": "\u00AB",  # «
    "raquo": "\u00BB",  # »
    "micro": "\u00B5",  # µ
    "para": "\u00B6",   # ¶
    "sect": "\u00A7",   # §
    "deg": "\u00B0",    # °
    "plusmn": "\u00B1", # ±
    "sup1": "\u00B9",   # ¹
    "sup2": "\u00B2",   # ²
    "sup3": "\u00B3",   # ³
    "acute": "\u00B4",  # ´
    "cedil": "\u00B8",  # ¸
    "ordf": "\u00AA",   # ª
    "ordm": "\u00BA",   # º
    "alpha": "\u03B1",  # α
    "beta": "\u03B2",   # β
    "gamma": "\u03B3",  # γ
    "delta": "\u03B4",  # δ
    "epsilon": "\u03B5",# ε
    "zeta": "\u03B6",   # ζ
    "eta": "\u03B7",    # η
    "theta": "\u03B8",  # θ
    "iota": "\u03B9",   # ι
    "kappa": "\u03BA",  # κ
    "lambda": "\u03BB", # λ
    "mu": "\u03BC",     # μ
    "nu": "\u03BD",     # ν
    "xi": "\u03BE",     # ξ
    "omicron": "\u03BF",# ο
    "pi": "\u03C0",     # π
    "rho": "\u03C1",    # ρ
    "sigmaf": "\u03C2", # ς
    "sigma": "\u03C3",  # σ
    "tau": "\u03C4",    # τ
    "upsilon": "\u03C5",# υ
    "phi": "\u03C6",    # φ
    "chi": "\u03C7",    # χ
    "psi": "\u03C8",    # ψ
    "omega": "\u03C9",  # ω
    "thetasym": "\u03D1",# ϑ
    "upsih": "\u03D2",  # ϒ
    "piv": "\u03D6",    # ϖ
    "bull": "\u2022",   # •
    "hellip": "\u2026", # …
    "prime": "\u2032",  # ′
    "Prime": "\u2033",  # ″
    "oline": "\u203E",  # ‾
    "frasl": "\u2044",  # ⁄
    "weierp": "\u2118", # ℘
    "image": "\u2111",  # ℑ
    "real": "\u211C",   # ℜ
    "trade": "\u2122",  # ™
    "alefsym": "\u2135",# ℵ
    "larr": "\u2190",   # ←
    "uarr": "\u2191",   # ↑
    "rarr": "\u2192",   # →
    "darr": "\u2193",   # ↓
    "harr": "\u2194",   # ↔
    "crarr": "\u21B5",  # ↵
    "lArr": "\u21D0",   # ⇐
    "uArr": "\u21D1",   # ⇑
    "rArr": "\u21D2",   # ⇒
    "dArr": "\u21D3",   # ⇓
    "hArr": "\u21D4",   # ⇔
    "forall": "\u2200", # ∀
    "part": "\u2202",   # ∂
    "exist": "\u2203",  # ∃
    "empty": "\u2205",  # ∅
    "nabla": "\u2207",  # ∇
    "isin": "\u2208",   # ∈
    "notin": "\u2209",  # ∉
    "ni": "\u220B",     # ∋
    "prod": "\u220F",   # ∏
    "sum": "\u2211",    # ∑
    "minus": "\u2212",  # −
    "lowast": "\u2217", # ∗
    "radic": "\u221A",  # √
    "prop": "\u221D",   # ∝
    "infin": "\u221E",  # ∞
    "ang": "\u2220",    # ∠
    "and": "\u2227",    # ∧
    "or": "\u2228",     # ∨
    "cap": "\u2229",    # ∩
    "cup": "\u222A",    # ∪
    "int": "\u222B",    # ∫
    "there4": "\u2234", # ∴
    "sim": "\u223C",    # ∼
    "cong": "\u2245",   # ≅
    "asymp": "\u2248",  # ≈
    "ne": "\u2260",     # ≠
    "equiv": "\u2261",  # ≡
    "le": "\u2264",     # ≤
    "ge": "\u2265",     # ≥
    "sub": "\u2282",    # ⊂
    "sup": "\u2283",    # ⊃
    "nsub": "\u2284",   # ⊄
    "sube": "\u2286",   # ⊆
    "supe": "\u2287",   # ⊇
    "oplus": "\u2295",  # ⊕
    "otimes": "\u2297", # ⊗
    "perp": "\u22A5",   # ⊥
    "sdot": "\u22C5",   # ⋅
    "lceil": "\u2308",  # ⌈
    "rceil": "\u2309",  # ⌉
    "lfloor": "\u230A", # ⌊
    "rfloor": "\u230B", # ⌋
    "lang": "\u2329",   # 〈
    "rang": "\u232A"    # 〉
}

simsong commented 2 months ago

ChatGPT works best when you know what questions ask and you can validate the responses.

sbarber2 commented 2 months ago

Oh, interesting: xml.etree.ElementTree's not accepting (externally) predefined character reference entities is considered a security feature.

See:

1) https://docs.python.org/3.11/library/xml.html#xml-vulnerabilities 2) Billion Laughs Attack

I learn something new every day.

simsong commented 2 months ago

Welcome to my world... https://simson.net/ref/2011/bulk_extractor.pdf p. 9:

sbarber2 commented 2 months ago

OK, this turns out not to be an obvious implementation, at least for me.

Creating a new parser object and setting its EntityDeclHandler looks attractive, but isn't working for me.

xml.etree.ElementTree.XMLParser is the class we want an instance of for a parser to be used with ElementTree.fromstring()

But EntityDeclHander is not an attribute of XMLParser. It's an attribute, supposedly of xml.parsers.expat, which is a wrapper around a C expat library that supposedly does the actual XML parsing. The source code for XMLParser shows that an XMLParser has an attribute named parser that's an expat, which if we can get at that we should be able to provide and EntityDeclHandler. But at runtime, no such attribute exists. And there's some magic going on because at runtime there's an XMLParser attribute named _parse_whole that seems to supersede the attributes actually in the XMLParser source code, and there's no parser attribute at all.

After spending a couple hours scratching my head, I think I have tired of this exercise, at least for now.

sbarber2 commented 2 months ago

I'll also note that the ChatGPT-generated code, if we try to run it, gives this:

>>> tree = parse_xhtml(xhtml_string)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<stdin>", line 3, in parse_xhtml
  File "/opt/homebrew/Cellar/python@3.11/3.11.9_1/Frameworks/Python.framework/Versions/3.11/lib/python3.11/xml/etree/ElementTree.py", line 1350, in XML
    parser.feed(text)
    ^^^^^^^^^^^
AttributeError: 'pyexpat.xmlparser' object has no attribute 'feed'

which is absolutely consistent with what I was seeing in my efforts elsewhere. Passing an expat into the ElementTree.XML() function is just wrong. Yay ChatGPT.

sbarber2 commented 2 months ago

So, life is short, and I've decided to stop working on this for now and just use Unicode entities where it's an issue. Un-assigning myself from this issue, though it would be lovely to fix it someday.

Plant-Tracer / webapp

validate_xml doesn't handle HTML entities #508