Unstructured-IO / unstructured

Open source libraries and APIs to build custom preprocessing pipelines for labeling, training, or production machine learning pipelines.
https://www.unstructured.io/
Apache License 2.0
7.4k stars 573 forks source link

bug(html): <br/> element breaks paragraph/document-element #3227

Open scanny opened 1 week ago

scanny commented 1 week ago

A <br/> element within a block item (e.g. <div>, <p>, etc. causes the element to be closed and a new one started, mid-paragraph.

According to the HTML Standard, a <br/> element is a phrasing (inline) element and indicates a line-break but does not divide the current block item in two.

To Reproduce

from unstructured.partition.html import partition_html
from unstructured.staging.base import elements_to_json

html_text = """
<p>
  Too old to begin<br/>
  training of young Skywalker.<br/>
  But teach him, I must.
</p>
"""
elements = partition_html(text=html_text)
print(f"{elements_to_json(elements, indent=2)}")

Expected:

[
  {
    "element_id": "44c358a19a9d7b9a1c3c9ac999b5bf93",
    "metadata": {
      "filetype": "text/html",
      "languages": ["eng"]
    },
    "text": "Too old to begin training of young Skywalker. But teach him, I must.",
    "type": "NarrativeText"
  }
]

Actual:

[
  {
    "element_id": "1db998dab696f74fde0d739e735f2000",
    "metadata": {
      "filetype": "text/html",
      "languages": ["eng"]
    },
    "text": "Too old to begin",
    "type": "NarrativeText"
  },
  {
    "element_id": "a4353319423f2724bc3f216847bf08fb",
    "metadata": {
      "category_depth": 0,
      "filetype": "text/html",
      "languages": ["eng"]
    },
    "text": "training of young Skywalker.",
    "type": "Title"
  },
  {
    "element_id": "1227d247a181c4f0aefb7992a1ee7ac4",
    "metadata": {
      "filetype": "text/html",
      "languages": ["eng"],
      "parent_id": "a4353319423f2724bc3f216847bf08fb"
    },
    "text": "But teach him, I must.",
    "type": "NarrativeText"
  }
]

Additional context Fixed by #3218. Recorded here to explain ingest test output changes and to inform CHANGELOG.