Unstructured-IO / unstructured

Open source libraries and APIs to build custom preprocessing pipelines for labeling, training, or production machine learning pipelines.
https://www.unstructured.io/
Apache License 2.0
7.37k stars 572 forks source link

rfctr(html): prepare for new html parser #3257

Closed scanny closed 1 week ago

scanny commented 1 week ago

Summary Extract as much mechanical refactoring from the HTML parser change-over into the PR as possible. This leaves the next PR focused on installing the new parser and the ingest-test impact.

Reviewers: Commits are well groomed and reviewing commit-by-commit is probably easier.

Additional Context This PR introduces the rewritten HTML parser. Its general design is recursive, consistent with the recursive structure of HTML (tree of elements). It also adds the unit tests for that parser but it does not install the parser. So the behavior of partition_html() is unchanged by this PR. The next PR in this series will do that and handle the ingest and other unit test changes required to reflect the dozen or so bug-fixes the new parser provides.

scanny commented 1 week ago

Thanks @Coniferish, I'll get those changes in before I merge :)