Unstructured-IO / unstructured

Open source libraries and APIs to build custom preprocessing pipelines for labeling, training, or production machine learning pipelines.
https://www.unstructured.io/
Apache License 2.0
7.4k stars 573 forks source link

bug(html): nested lists are squashed #3229

Open scanny opened 1 week ago

scanny commented 1 week ago

Summary

To Reproduce

from unstructured.partition.html import partition_html
from unstructured.staging.base import elements_to_json

html_text = """
<ul>
  <li>foo</li>
  <li>
    <ol>
      <li>first</li>
      <li>second</li>
  </li>
</ul>
"""

elements = partition_html(text=html_text)
print(f"{elements_to_json(elements, indent=2)}")

Expected:

[
  {
    "element_id": "0b460a31b167710ce27995abb2dc4cbd",
    "metadata": {
      "category_depth": 1,
      "filetype": "text/html",
      "languages": ["eng"]
    },
    "text": "foo",
    "type": "ListItem"
  },
  {
    "element_id": "2a3077c93b2a754629ee52b0e4e8ff11",
    "metadata": {
      "category_depth": 2,
      "filetype": "text/html",
      "languages": ["eng"],
      "parent_id": "0b460a31b167710ce27995abb2dc4cbd"
    },
    "text": "first",
    "type": "ListItem"
  },
  {
    "element_id": "1b1e0e9be12f02351e2085308766a44f",
    "metadata": {
      "category_depth": 2,
      "filetype": "text/html",
      "languages": ["eng"],
      "parent_id": "0b460a31b167710ce27995abb2dc4cbd"
    },
    "text": "second",
    "type": "ListItem"
  }
]

Actual:

[
  {
    "element_id": "0b460a31b167710ce27995abb2dc4cbd",
    "metadata": {
      "category_depth": 1,
      "filetype": "text/html",
      "languages": ["eng"]
    },
    "text": "foo",
    "type": "ListItem"
  },
  {
    "element_id": "9076282f2333a371a3f2889f789e6641",
    "metadata": {
      "category_depth": 1,
      "filetype": "text/html",
      "languages": ["eng"]
    },
    "text": "first\n      second",
    "type": "ListItem"
  }
]

Additional context Fixed by #3218. Recorded here to explain ingest test output changes and to inform CHANGELOG.