Unstructured-IO / unstructured

Open source libraries and APIs to build custom preprocessing pipelines for labeling, training, or production machine learning pipelines.
https://www.unstructured.io/
Apache License 2.0
7.4k stars 573 forks source link

bug(html): source-formatting whitespace appears in link_texts metadata and Element text #3230

Open scanny opened 1 week ago

scanny commented 1 week ago

Summary

To Reproduce

html_text = """
<p>
  foo
  <a href="http://eie.io">
    bar
  </a>
</p>
"""

elements = partition_html(text=html_text)
print(f"{elements_to_json(elements, indent=2)}")

Expected:

[
  {
    "element_id": "13576562f85267c48e44286353dbc991",
    "metadata": {
      "filetype": "text/html",
      "languages": ["eng"],
      "link_texts": ["bar"],
      "link_urls": ["http://eie.io"]
    },
    "text": "foo bar",
    "type": "UncategorizedText"
  }
]

Actual:

[
  {
    "element_id": "cfa1474ad714b51d88959042ecce79dd",
    "metadata": {
      "category_depth": 0,
      "filetype": "text/html",
      "languages": ["eng"],
      "link_texts": ["\n    bar\n  "],
      "link_urls": ["http://eie.io"]
    },
    "text": "foo\n  \n    bar",
    "type": "Title"
  }
]

Additional context Fixed by #3218. Recorded here to explain ingest test output changes and to inform CHANGELOG.