Unstructured-IO / unstructured

Open source libraries and APIs to build custom preprocessing pipelines for labeling, training, or production machine learning pipelines.
https://www.unstructured.io/
Apache License 2.0
7.44k stars 580 forks source link

rfctr(html): break coupling to DocumentLayout #3180

Closed scanny closed 3 weeks ago

scanny commented 3 weeks ago

Summary Remove use of partition.common.document_to_element_list() by HTMLDocument. The transitive coupling with layout-inference through this shared function have been the source of frustration and a drain on engineering time and there's no compelling reason for the two to share this code.

Additional Context partition_html() uses partition.common.document_to_element_list() to get finalized elements from HTMLDocument (pages). This gives rise to a very nasty coupling between DocumentLayout, used by unstructured_inference, and HTMLDocument. document_to_element_list() has evolved to work for both callers, but they share very few common characteristics with each other.

This coupling is bad news for us and also, importantly, for the inference and page layout folks working on PDF and images.

Break that coupling so those inference-related functions can evolve whatever way they need to without being dragged down by legacy HTMLDocument connections.

The initial step is to extract a document_to_element_list() function of our own, getting rid of the coordinates and other DocumentLayout-related bits we don't need. As you'll see in the next few PRs, all of this document_to_element_list() code will end up either going away or being relocated closer to where it's used in HTMLDocument.

sentry-io[bot] commented 3 weeks ago

Suspect Issues

This pull request was deployed and Sentry observed the following issues:

Did you find this useful? React with a 👍 or 👎