Open scanny opened 2 weeks ago
Hi!
@scanny great improvement! I found a bug inside class Flow, missing implementation of AttributeError: 'Flow' object has no attribute 'iter_text_segments'
def iter_text_segments(self, text: str) -> Iterator[TextSegment]:
when call in Phrasing:
def _iter_child_text_segments(self, emphasis: str) -> Iterator[TextSegment]:
"""Generate zero-or-more text-segments for phrasing children of this element.
All generated text segments will be annotated with `emphasis` when it is other than the
empty string.
"""
for child in self:
yield from child.iter_text_segments(emphasis)
best regards,
Hi @heralight, thanks! :)
Can you provide a (real-life) HTML snippet that triggers a problem with that? An AttributeError: Flow has no attribute 'iter_text_segments'
I suppose?
In general, phrasing elements cannot themselves contain block elements (so this method would not be called on a Flow
element), but I may have missed one or two that should really be classified as block types or need a custom implementation because they can appear in both roles. So I'm keen for breaking examples that actually occur somewhere in the wild.
I don't need a whole document and don't care what the text is. A small fragment like this will do the trick:
<div>
Text of div <b>with <i>hierarchical</i>\nphrasing</b> content before first block item
<p>Click <a href="http://blurb.io">here</a> to see the blurb for this block item. </p>
tail of block item <b>with <i>hierarchical</i> phrasing </b> content
</div>
@heralight btw, I expect shortly we will add that method to Flow
, or implement some behavioral equivalent, to treat a block item child of a phrasing element as inline content, perhaps issuing a warning when that happens.
I wanted to leave it as an error at first here to try to flush out any legitimate examples where I've mis-classified a flow element as phrasing or missed an element that can take either block or inline display-role depending on its placement.
@scanny , a basic sample is:
<a>
<div>
</div>
</a>
I found it in a real mainstream website like:
<a>
<div class="relative inline-flex">
<span data-spark-component="badge" role="status"></span>
</div><span class="text-caption">Checkout</span>
<!-- a lot of other elements (arround 30)-->
</a>
my current workarround is to add:
def iter_text_segments(self, text: str) -> Iterator[TextSegment]:
yield TextSegment(text, {})
to Flow
@heralight Ah, okay, this is a very helpful example. It occurred to me we might need to be more sophisticated in how we handle anchor (<a>
) elements because of their potential dual role depending on where they appear.
Here's the solution I think makes sense for this case:
Anchor.is_phrasing
dynamic, True
when .is_phrasing
is True for all its children but False
when it contains any block (Flow) child elements (.is_phrasing
reports False for one or more of its children).Anchor.iter_elements()
to handle the latter case.I'm thinking this is the first of a "dual-role" category of element-types. Not sure yet whether it's the only one but I expect there are others.
I've just spiked that out and it's a very small change, so I'll add tests for it and add it to this PR.
With that change, this code:
from unstructured.partition.html import partition_html
from unstructured.staging.base import elements_to_json
html_text = """
<div>
“O Deep Thought computer," he said,
<a>
<div>The task we have designed you to perform is this.</div>
<p>We want you to tell us.... he paused,"</p>
</a>
</div>
"""
elements = partition_html(text=html_text)
print(f"{elements_to_json(elements, indent=2)}")
produces these elements:
[
{
"element_id": "658df72fa2c379d60e265845e5ceb654",
"metadata": {
"filetype": "text/html",
"languages": ["eng"]
},
"text": "\u201cO Deep Thought computer,\" he said,",
"type": "NarrativeText"
},
{
"element_id": "4cfbb7b43a83e28c32c8186e4c804b58",
"metadata": {
"filetype": "text/html",
"languages": ["eng"]
},
"text": "The task we have designed you to perform is this.",
"type": "NarrativeText"
},
{
"element_id": "ed2fa97fc8314fb9b36c2177faf848d8",
"metadata": {
"filetype": "text/html",
"languages": ["eng"]
},
"text": "We want you to tell us.... he paused,\"",
"type": "NarrativeText"
}
]
@scanny make sense! Thanks!
Summary Replace legacy HTML parser with recursive version that captures all content and provides flexibility to add new metadata. It's also substantially faster although that's just a happy side-effect.
Additional Context The prior HTML parsing algorithm that makes up the core of HTML partitioning was buggy and very difficult to reason about because it did not conform to the inherently recursive structure of HTML. The new version retains
lxml
as the performant and reliable base library but useslxml
's custom element classes to efficiently classify HTML elements by their behaviors (block-item and inline (phrasing) primarily) and give those elements the desired partitioning behaviors.This solves a host of existing problems with content being skipped and elements (paragraphs) being divided improperly, but also provides a clear domain model for reasoning about its behavior and reliably adjusting it to suit our existing and future purposes.
The parser's operation is recursive, closely modeling the recursive structure of HTML itself. It's behaviors are based on the HTML Standard and reliably produce proper and explainable results even for novel cases.
Fixes #2325 Fixes #2562 Fixes #2675 Fixes #3168 Fixes #3227 Fixes #3228 Fixes #3230 Fixes #3237 Fixes #3245 Fixes #3247 Fixes #3255 Fixes #3309
BEHAVIOR DIFFERENCES
emphasized_text_tags
encoding is changed:<strong>
is encoded as"b"
rather than"strong"
.<em>
is encoded as"i"
rather than"em"
.<span>
is no longer recorded inemphasized_text_tags
(because without the CSS we can't tell whether it's used for emphasis or if so what kind).emphasized_text_contents
is broken on emphasis-change boundaries, like:produces:
whereas previously it would have produced:
<pre>
text is preserved as it appears in the htmlExcept that a leading newline is removed if present (has to be in position 0 of text). Also, a trailing newline is stripped but only if it appears in the very last position ([-1]) of the
<pre>
text. Old parser stripped all leading and trailing whitespace.Result is that:
parses to
"foo\nbar\nbaz"
which is the same result produced for:This equivalence is the same behavior exhibited by a browser, which is why we did the extra work to make it this way.
Whitespace normalization
Leading and trailing whitespace are removed from element text, just as it is removed in the browser. Runs of whitespace within the element text are reduced to a single space character (like in the browser). Note this means that
\t
,\n
, and
are replaced with a regular space character. All text derived from elements is whitespace normalized except the text within a<pre>
tag. Any leading or trailing newline is trimmed from<pre>
element text; all other whitespace is preserved just as it appeared in the HTML source.link_start_indexes
metadata is no longer captured. Rationale:-1
.<br/>
element is replaced with a single newline ("\n"
)but that is usually replaced with a space in
Element.text
when it is normalized. The newline is preserved within a<pre>
element.<br/><br/>
Empty
h1..h6
elements are dropped.HTML heading elements (
<h1..h6>
) are "skipped" (do not generate aTitle
element) when they contain no text or contain only whitespace.