Summary
Block items nested within an <li> element are squashed into single ListItem element. Also, formatting whitespace is not normalized in the resulting text.
To Reproduce
from unstructured.partition.html import partition_html
from unstructured.staging.base import elements_to_json
html_text = """
<ul>
<li>
<p>One of the <b>things</b> Ford Prefect had always found.</p>
Hardest to <i>understand</i> about humans was.
<p>Their habit of continually <b>stating</b> and <b>repeating</b> the.</p>
very <i>very</i> obvious.
</li>
</ul>
"""
elements = partition_html(text=html_text)
print(f"{elements_to_json(elements, indent=2)}")
Summary Block items nested within an
<li>
element are squashed into singleListItem
element. Also, formatting whitespace is not normalized in the resulting text.To Reproduce
Expected:
Actual:
Additional context Fixed by #3218. Recorded here to explain ingest test output changes and to inform CHANGELOG.