Unstructured-IO / unstructured

Open source libraries and APIs to build custom preprocessing pipelines for labeling, training, or production machine learning pipelines.
https://www.unstructured.io/
Apache License 2.0
7.8k stars 626 forks source link

fix: set `resolve_entities=False` in `partition_xml` #3088

Closed MthwRobinson closed 2 months ago

MthwRobinson commented 2 months ago

Summary

Closes #3078. Sets resolve_entities=False for parsing XML with lxml in partition_xml to avoid text being dynamically injected into the document.

Testing

pytest test_unstructured/partition/test_xml.py continues to pass with the update.