Closed scanny closed 3 weeks ago
This pull request was deployed and Sentry observed the following issues:
/general/v0/general
View Issuemax_characters
value, got 800 > 500 /general/v0/general
View IssueDid you find this useful? React with a 👍 or 👎
Summary Remove use of
partition.common.document_to_element_list()
byHTMLDocument
. The transitive coupling with layout-inference through this shared function have been the source of frustration and a drain on engineering time and there's no compelling reason for the two to share this code.Additional Context
partition_html()
usespartition.common.document_to_element_list()
to get finalized elements fromHTMLDocument
(pages). This gives rise to a very nasty coupling betweenDocumentLayout
, used byunstructured_inference
, andHTMLDocument
.document_to_element_list()
has evolved to work for both callers, but they share very few common characteristics with each other.This coupling is bad news for us and also, importantly, for the inference and page layout folks working on PDF and images.
Break that coupling so those inference-related functions can evolve whatever way they need to without being dragged down by legacy
HTMLDocument
connections.The initial step is to extract a
document_to_element_list()
function of our own, getting rid of the coordinates and otherDocumentLayout
-related bits we don't need. As you'll see in the next few PRs, all of thisdocument_to_element_list()
code will end up either going away or being relocated closer to where it's used inHTMLDocument
.