Unstructured-IO / unstructured

Open source libraries and APIs to build custom preprocessing pipelines for labeling, training, or production machine learning pipelines.
https://www.unstructured.io/
Apache License 2.0
7.44k stars 580 forks source link

rfctr(html): replace html parser #3218

Open scanny opened 2 weeks ago

scanny commented 2 weeks ago

Summary Replace legacy HTML parser with recursive version that captures all content and provides flexibility to add new metadata. It's also substantially faster although that's just a happy side-effect.

Additional Context The prior HTML parsing algorithm that makes up the core of HTML partitioning was buggy and very difficult to reason about because it did not conform to the inherently recursive structure of HTML. The new version retains lxml as the performant and reliable base library but uses lxml's custom element classes to efficiently classify HTML elements by their behaviors (block-item and inline (phrasing) primarily) and give those elements the desired partitioning behaviors.

This solves a host of existing problems with content being skipped and elements (paragraphs) being divided improperly, but also provides a clear domain model for reasoning about its behavior and reliably adjusting it to suit our existing and future purposes.

The parser's operation is recursive, closely modeling the recursive structure of HTML itself. It's behaviors are based on the HTML Standard and reliably produce proper and explainable results even for novel cases.

Fixes #2325 Fixes #2562 Fixes #2675 Fixes #3168 Fixes #3227 Fixes #3228 Fixes #3230 Fixes #3237 Fixes #3245 Fixes #3247 Fixes #3255 Fixes #3309

BEHAVIOR DIFFERENCES

emphasized_text_tags encoding is changed:

<pre> text is preserved as it appears in the html

Except that a leading newline is removed if present (has to be in position 0 of text). Also, a trailing newline is stripped but only if it appears in the very last position ([-1]) of the <pre> text. Old parser stripped all leading and trailing whitespace.

Result is that:

<pre>
foo
bar
baz
</pre>

parses to "foo\nbar\nbaz" which is the same result produced for:

<pre>foo
bar
baz</pre>

This equivalence is the same behavior exhibited by a browser, which is why we did the extra work to make it this way.

Whitespace normalization

Leading and trailing whitespace are removed from element text, just as it is removed in the browser. Runs of whitespace within the element text are reduced to a single space character (like in the browser). Note this means that \t, \n, and &nbsp; are replaced with a regular space character. All text derived from elements is whitespace normalized except the text within a <pre> tag. Any leading or trailing newline is trimmed from <pre> element text; all other whitespace is preserved just as it appeared in the HTML source.

link_start_indexes metadata is no longer captured. Rationale:

<br/> element is replaced with a single newline ("\n")

but that is usually replaced with a space in Element.text when it is normalized. The newline is preserved within a <pre> element.

Empty h1..h6 elements are dropped.

HTML heading elements (<h1..h6>) are "skipped" (do not generate a Title element) when they contain no text or contain only whitespace.

heralight commented 1 week ago

Hi!

@scanny great improvement! I found a bug inside class Flow, missing implementation of AttributeError: 'Flow' object has no attribute 'iter_text_segments'

    def iter_text_segments(self, text: str) -> Iterator[TextSegment]:

when call in Phrasing:

    def _iter_child_text_segments(self, emphasis: str) -> Iterator[TextSegment]:
        """Generate zero-or-more text-segments for phrasing children of this element.

        All generated text segments will be annotated with `emphasis` when it is other than the
        empty string.
        """
        for child in self:
            yield from child.iter_text_segments(emphasis)

best regards,

scanny commented 1 week ago

Hi @heralight, thanks! :)

Can you provide a (real-life) HTML snippet that triggers a problem with that? An AttributeError: Flow has no attribute 'iter_text_segments' I suppose?

In general, phrasing elements cannot themselves contain block elements (so this method would not be called on a Flow element), but I may have missed one or two that should really be classified as block types or need a custom implementation because they can appear in both roles. So I'm keen for breaking examples that actually occur somewhere in the wild.

I don't need a whole document and don't care what the text is. A small fragment like this will do the trick:

          <div>
            Text of div <b>with <i>hierarchical</i>\nphrasing</b> content before first block item
            <p>Click <a href="http://blurb.io">here</a> to see the blurb for this block item. </p>
            tail of block item <b>with <i>hierarchical</i> phrasing </b> content
          </div>
scanny commented 1 week ago

@heralight btw, I expect shortly we will add that method to Flow, or implement some behavioral equivalent, to treat a block item child of a phrasing element as inline content, perhaps issuing a warning when that happens.

I wanted to leave it as an error at first here to try to flush out any legitimate examples where I've mis-classified a flow element as phrasing or missed an element that can take either block or inline display-role depending on its placement.

heralight commented 1 week ago

@scanny , a basic sample is:

<a>
  <div>
  </div>
</a>

I found it in a real mainstream website like:

<a>
  <div class="relative inline-flex">
    <span data-spark-component="badge" role="status"></span>
  </div><span class="text-caption">Checkout</span>
  <!-- a lot of other elements (arround 30)-->
</a>

my current workarround is to add:

    def iter_text_segments(self, text: str) -> Iterator[TextSegment]:
        yield TextSegment(text, {})

to Flow

scanny commented 1 week ago

@heralight Ah, okay, this is a very helpful example. It occurred to me we might need to be more sophisticated in how we handle anchor (<a>) elements because of their potential dual role depending on where they appear.

Here's the solution I think makes sense for this case:

  1. Make Anchor.is_phrasing dynamic, True when .is_phrasing is True for all its children but False when it contains any block (Flow) child elements (.is_phrasing reports False for one or more of its children).
  2. Add Anchor.iter_elements() to handle the latter case.

I'm thinking this is the first of a "dual-role" category of element-types. Not sure yet whether it's the only one but I expect there are others.

I've just spiked that out and it's a very small change, so I'll add tests for it and add it to this PR.

With that change, this code:

from unstructured.partition.html import partition_html
from unstructured.staging.base import elements_to_json

html_text = """
<div>
  “O Deep Thought computer," he said,
  <a>
    <div>The task we have designed you to perform is this.</div>
    <p>We want you to tell us.... he paused,"</p>
  </a>
</div>
"""

elements = partition_html(text=html_text)
print(f"{elements_to_json(elements, indent=2)}")

produces these elements:

[
  {
    "element_id": "658df72fa2c379d60e265845e5ceb654",
    "metadata": {
      "filetype": "text/html",
      "languages": ["eng"]
    },
    "text": "\u201cO Deep Thought computer,\" he said,",
    "type": "NarrativeText"
  },
  {
    "element_id": "4cfbb7b43a83e28c32c8186e4c804b58",
    "metadata": {
      "filetype": "text/html",
      "languages": ["eng"]
    },
    "text": "The task we have designed you to perform is this.",
    "type": "NarrativeText"
  },
  {
    "element_id": "ed2fa97fc8314fb9b36c2177faf848d8",
    "metadata": {
      "filetype": "text/html",
      "languages": ["eng"]
    },
    "text": "We want you to tell us.... he paused,\"",
    "type": "NarrativeText"
  }
]
heralight commented 1 week ago

@scanny make sense! Thanks!