Unstructured-IO / unstructured

Open source libraries and APIs to build custom preprocessing pipelines for labeling, training, or production machine learning pipelines.
https://www.unstructured.io/
Apache License 2.0
9.21k stars 764 forks source link

Add text as html to orig elements chunks #3779

Closed plutasnyy closed 1 day ago

plutasnyy commented 1 week ago

This simplest solution doesn't drop HTML from metadata when merging Elements from HTML input. We still need to address how to handle nested elements, and if we want to have LayoutElements in the metadata of Composite Elements, a unit test showing the current behavior. Note: metadata still contains orig_elements which has all the metadata.