SummaryHTMLDocument is the class handling the core of HTML parsing. This is critical code because 8 of the 20 file-type partitioners end up using this code (partition_html() + 7 brokering partitioners like EPUB, MD, and RST).
For historical reasons, HTMLDocument subclassed XMLDocument which in turn subclassed Document, both of which are no longer relevant and unnecessarily complicate reasoning about HTMLDocument behavior.
Remove that inheritance and dependency and drop both XMLDocument and Document modules which become dead code after no longer being used by HTMLDocument.
Summary
HTMLDocument
is the class handling the core of HTML parsing. This is critical code because 8 of the 20 file-type partitioners end up using this code (partition_html()
+ 7 brokering partitioners like EPUB, MD, and RST).For historical reasons,
HTMLDocument
subclassedXMLDocument
which in turn subclassedDocument
, both of which are no longer relevant and unnecessarily complicate reasoning aboutHTMLDocument
behavior.Remove that inheritance and dependency and drop both
XMLDocument
andDocument
modules which become dead code after no longer being used byHTMLDocument
.