Unstructured-IO / unstructured

Open source libraries and APIs to build custom preprocessing pipelines for labeling, training, or production machine learning pipelines.
https://www.unstructured.io/
Apache License 2.0
7.8k stars 626 forks source link

fix: disable table_as_cells output by default #3093

Closed badGarnet closed 2 months ago

badGarnet commented 2 months ago

This PR changes the output of table elements: now by default the table elements' metadata.table_as_cells is None. The data will only be populated when the env EXTRACT_TABLE_AS_CELLS is set to true.

The original design of the table_as_cells is for evaluate table extraction performance. The format itself is not as readable as the table_as_html metadata for human or RAG consumption. Therefore by default this data is not needed.

Since this output is meant for evaluation use this PR choose to use an environment variable to control if it should be present in the partitioned results. This approach avoids adding parameters to the partition function call. Adding a new parameter to the partition interface increases the complexity of the interface and adds more maintenance cost since there is a long chain of function calls to pass down this parameter to where it is needed.

test

running the following code snippet on main vs. this PR

from unstructured.partition.auto import partition

elements = partition("example-docs/layout-parser-paper-with-table.pdf", strategy="hi_res", skip_infer_table_types=[])
table_cells = [element.metadata.table_as_cells, None) for element in elements if element.category == "Table"]

on main branch table_cells contains cell structured data but on this branch it is a list of None

However if we first set in terminal:

export EXTRACT_TABLE_AS_CELLS=true

then run the same code again with this PR the table_cells would contain actual data, the same as on main branch.