Unstructured-IO / unstructured

Open source libraries and APIs to build custom preprocessing pipelines for labeling, training, or production machine learning pipelines.
https://www.unstructured.io/
Apache License 2.0
7.4k stars 573 forks source link

bug(html): empty <li> element produces ListItem with no text #3237

Open scanny opened 1 week ago

scanny commented 1 week ago

Summary partition_html() produces a ListItem element with no text for an empty <li> element or one that contains only whitespace.

To Reproduce

from unstructured.partition.html import partition_html
from unstructured.staging.base import elements_to_json

html_text = """
<ul>
  <li></li>
  <li>  \n  \t  </li>
</ul>
"""

elements = partition_html(text=html_text)
print(f"{elements_to_json(elements, indent=2)}")

Expected:

[]

Actual:

[
  {
    "element_id": "5336294a19f32ff03ef80066fbc3e0f7",
    "metadata": {
      "category_depth": 1,
      "filetype": "text/html"
    },
    "text": "",
    "type": "ListItem"
  },
  {
    "element_id": "c91476816a43e6f9216a68b58d92076a",
    "metadata": {
      "category_depth": 1,
      "filetype": "text/html"
    },
    "text": "",
    "type": "ListItem"
  }
]

Additional context Fixed by #3218. Recorded here to explain ingest test output changes and to inform CHANGELOG.