Unstructured-IO / unstructured

Open source libraries and APIs to build custom preprocessing pipelines for labeling, training, or production machine learning pipelines.
https://www.unstructured.io/
Apache License 2.0
7.4k stars 573 forks source link

bug(html): distinct paragraphs within <li> are squashed into single element #3245

Open scanny opened 1 week ago

scanny commented 1 week ago

Summary Block items nested within an <li> element are squashed into single ListItem element. Also, formatting whitespace is not normalized in the resulting text.

To Reproduce

from unstructured.partition.html import partition_html
from unstructured.staging.base import elements_to_json

html_text = """
<ul>
  <li>
    <p>One of the <b>things</b> Ford Prefect had always found.</p>
    Hardest to <i>understand</i> about humans was.
    <p>Their habit of continually <b>stating</b> and <b>repeating</b> the.</p>
    very <i>very</i> obvious.
  </li>
</ul>
"""

elements = partition_html(text=html_text)
print(f"{elements_to_json(elements, indent=2)}")

Expected:

[
  {
    "element_id": "a0d3c85c1ac52e3097090f947dd0ba4f",
    "metadata": {
      "emphasized_text_contents": ["things"],
      "emphasized_text_tags": ["b"],
      "filetype": "text/html",
      "languages": ["eng"]
    },
    "text": "One of the things Ford Prefect had always found.",
    "type": "NarrativeText"
  },
  {
    "element_id": "18da4b100dbb92e55b91f35fc27aa23c",
    "metadata": {
      "emphasized_text_contents": ["understand"],
      "emphasized_text_tags": ["i"],
      "filetype": "text/html",
      "languages": ["eng"]
    },
    "text": "Hardest to understand about humans was.",
    "type": "NarrativeText"
  },
  {
    "element_id": "5bcd5901935365c374f3170479beffdf",
    "metadata": {
      "emphasized_text_contents": [
        "stating",
        "repeating"
      ],
      "emphasized_text_tags": ["b", "b"],
      "filetype": "text/html",
      "languages": ["eng"]
    },
    "text": "Their habit of continually stating and repeating the.",
    "type": "NarrativeText"
  },
  {
    "element_id": "90c24bc473cb71ec531c5543af409270",
    "metadata": {
      "category_depth": 0,
      "emphasized_text_contents": ["very"],
      "emphasized_text_tags": ["i"],
      "filetype": "text/html",
      "languages": ["eng"]
    },
    "text": "very very obvious.",
    "type": "Title"
  }
]

Actual:

[
  {
    "element_id": "cb8c2c5a83e9e6aa308d078d4510205b",
    "metadata": {
      "category_depth": 1,
      "emphasized_text_contents": [
        "things",
        "understand",
        "stating",
        "repeating",
        "very"
      ],
      "emphasized_text_tags": [
        "b",
        "i",
        "b",
        "b",
        "i"
      ],
      "filetype": "text/html",
      "languages": [
        "eng"
      ]
    },
    "text": "One of the things Ford Prefect had always found.\n    Hardest to understand about humans was.\n    Their habit of continually stating and repeating the.\n    very very obvious.",
    "type": "ListItem"
  }
]

Additional context Fixed by #3218. Recorded here to explain ingest test output changes and to inform CHANGELOG.