Exception handler tries to use nonexistent error class (html.HTMLParseError)

HainLuud commented 1 month ago

At line 349 of input.py the exception handler tries to access html.HTMLParseError, an error class, that used to exist in the cPython's html library but has been removed since Python 3.3.

The genshi code in question is this:

        def _generate():
            if self.encoding:
                reader = codecs.getreader(self.encoding)
                source = reader(self.source)
            else:
                source = self.source
            try:
                bufsize = 4 * 1024 # 4K
                done = False
                while 1:
                    while not done and len(self._queue) == 0:
                        data = source.read(bufsize)
                        if not data: # end of data
                            self.close()
                            done = True
                        else:
                            if not isinstance(data, six.text_type):
                                raise UnicodeError("source returned bytes, but no encoding specified")
                            self.feed(data)
                    for kind, data, pos in self._queue:
                        yield kind, data, pos
                    self._queue = []
                    if done:
                        open_tags = self._open_tags
                        open_tags.reverse()
                        for tag in open_tags:
                            yield END, QName(tag), pos
                        break
            except html.HTMLParseError as e:
                msg = '%s: line %d, column %d' % (e.msg, e.lineno, e.offset)
                raise ParseError(msg, self.filename, e.lineno, e.offset)
        return Stream(_generate()).filter(_coalesce)

FelixSchwarz commented 1 month ago

maybe you could share a minimal template which triggers this error? I am wondering why we did not notice that before.

HainLuud commented 1 month ago

It's a auto-generated fuzzing input so not too much logic here, but an example input would be 8zwhWz4= in base64

import base64
from genshi import HTML
from genshi.filters import HTMLSanitizer

inp = base64.b64decode("8zwhWz4=")
markup = HTML(inp) | HTMLSanitizer()

hodgestar commented 3 weeks ago

@HainLuud Thank you for reporting this. I've added a simple fixed and a slightly simpler version of the test you wrote.

@FelixSchwarz I guess this never came up because triggering an exception here is quite hard. The Python HTMLParser (when not in strict mode) accepts almost anything as valid HTML as long as its valid text.

edgewall / genshi

Exception handler tries to use nonexistent error class (html.HTMLParseError) #85