Unstructured-IO / unstructured

Open source libraries and APIs to build custom preprocessing pipelines for labeling, training, or production machine learning pipelines.
https://www.unstructured.io/
Apache License 2.0
9.21k stars 764 forks source link

feat/<short-name>Writing back the unstructured extracted partitions to the same file format #3699

Closed SinaRanjkeshzade closed 1 month ago

SinaRanjkeshzade commented 1 month ago

Is your feature request related to a problem? Please describe. In some use cases, we need to read files via Unstructured, process them to generate new text, and write them back. Since the input file formats can vary, having a 'write' functionality would be very helpful. Specifically, if Unstructured can use the metadata of each partition to save the text in the same format, it would enhance usability. For example, if a text is central or extracted from an image, writing it back in the same format would be beneficial.

Describe the solution you'd like I would like to have a functionality that writes the partitions back to the same file format while maintaining the original structure of the content.

Describe alternatives you've considered I don't have any alternatives for preserving the structure, but it would be feasible to implement different file writers, each supporting a specific file format for writing text.

scanny commented 1 month ago

Hi @SinaRanjkeshzade, such a feature set would be outside the scope of the unstructured library.

In general it is not possible to reconstruct an original document from the document elements we extract from it. The document elements are purposely focused solely on the content of interest to downstream NLP processes.

But mostly it's just not part of the purpose and intended use of the library.