Unstructured-IO / unstructured

Open source libraries and APIs to build custom preprocessing pipelines for labeling, training, or production machine learning pipelines.
https://www.unstructured.io/
Apache License 2.0
7.4k stars 573 forks source link

bug(html): <div> with both text and phrasing child breaks element at phrasing child #3228

Open scanny opened 1 week ago

scanny commented 1 week ago

Summary A <div> element having both text and a child phrasing element causes a new element to be started at the location of the child phrasing element.

To Reproduce

from unstructured.partition.html import partition_html
from unstructured.staging.base import elements_to_json

html_text = "<div>foo <b>bar</b></div>"

elements = partition_html(text=html_text)
print(f"{elements_to_json(elements, indent=2)}")

Expected

[
  {
    "element_id": "13576562f85267c48e44286353dbc991",
    "metadata": {
      "emphasized_text_contents": ["bar"],
      "emphasized_text_tags": ["b"],
      "filetype": "text/html",
      "languages": ["eng"]
    },
    "text": "foo bar",
    "type": "UncategorizedText"
  }
]

Actual

[
  {
    "element_id": "0e9776f45d842a2f3a93ff9684d65810",
    "metadata": {
      "category_depth": 0,
      "emphasized_text_contents": ["bar"],
      "emphasized_text_tags": ["b"],
      "filetype": "text/html",
      "languages": ["eng"]
    },
    "text": "foo ",
    "type": "Title"
  },
  {
    "element_id": "ee1d605475cd837168c9b4059c679c28",
    "metadata": {
      "category_depth": 0,
      "emphasized_text_contents": ["bar"],
      "emphasized_text_tags": ["b"],
      "filetype": "text/html",
      "languages": ["eng"]
    },
    "text": "bar",
    "type": "Title"
  }
]

Additional context Fixed by #3218. Recorded here to explain ingest test output changes and to inform CHANGELOG.