Unstructured-IO / unstructured

Open source libraries and APIs to build custom preprocessing pipelines for labeling, training, or production machine learning pipelines.
https://www.unstructured.io/
Apache License 2.0
8.13k stars 658 forks source link

Unstrutured library is unable to extract CDATA from the xml data #3075

Open PhaneendraGunda opened 3 months ago

PhaneendraGunda commented 3 months ago

Sample XML:

<GENERAL_INFO><TITLE><![CDATA[Mobile Apple Devices (iPhones, iPads, and Smartwatches)]]></TITLE><SUMMARY><![CDATA[<p>This article highlights the key benefits and specifications of Apple iPhones, iPads, and Smartwatches.</p></SUMMARY></GENERAL_INFO>

Code to fetch data from the XML

from unstructured.partition.html import partition_html

_text = ' '.join([element.text for element in partition_html(text=_html_text)])

Is there any flag or function to enable extracting content from the CDATA ?

shreyanid commented 3 months ago

Thanks for the issue @PhaneendraGunda ! We'll discuss and follow up