Unstructured-IO / unstructured

Open source libraries and APIs to build custom preprocessing pipelines for labeling, training, or production machine learning pipelines.
https://www.unstructured.io/
Apache License 2.0
7.49k stars 584 forks source link

rfctr(html): drop now dead XMLDocument and Document #3165

Closed scanny closed 3 weeks ago

scanny commented 4 weeks ago

Summary HTMLDocument is the class handling the core of HTML parsing. This is critical code because 8 of the 20 file-type partitioners end up using this code (partition_html() + 7 brokering partitioners like EPUB, MD, and RST).

For historical reasons, HTMLDocument subclassed XMLDocument which in turn subclassed Document, both of which are no longer relevant and unnecessarily complicate reasoning about HTMLDocument behavior.

Remove that inheritance and dependency and drop both XMLDocument and Document modules which become dead code after no longer being used by HTMLDocument.