A <br/> element within a block item (e.g. <div>, <p>, etc. causes the element to be closed and a new one started, mid-paragraph.
According to the HTML Standard, a <br/> element is a phrasing (inline) element and indicates a line-break but does not divide the current block item in two.
To Reproduce
from unstructured.partition.html import partition_html
from unstructured.staging.base import elements_to_json
html_text = """
<p>
Too old to begin<br/>
training of young Skywalker.<br/>
But teach him, I must.
</p>
"""
elements = partition_html(text=html_text)
print(f"{elements_to_json(elements, indent=2)}")
Expected:
[
{
"element_id": "44c358a19a9d7b9a1c3c9ac999b5bf93",
"metadata": {
"filetype": "text/html",
"languages": ["eng"]
},
"text": "Too old to begin training of young Skywalker. But teach him, I must.",
"type": "NarrativeText"
}
]
A
<br/>
element within a block item (e.g.<div>
,<p>
, etc. causes the element to be closed and a new one started, mid-paragraph.According to the HTML Standard, a
<br/>
element is a phrasing (inline) element and indicates a line-break but does not divide the current block item in two.To Reproduce
Expected:
Actual:
Additional context Fixed by #3218. Recorded here to explain ingest test output changes and to inform CHANGELOG.