This PR changes the output of table elements: now by default the table elements' metadata.table_as_cells is None. The data will only be populated when the env EXTRACT_TABLE_AS_CELLS is set to true.
The original design of the table_as_cells is for evaluate table extraction performance. The format itself is not as readable as the table_as_html metadata for human or RAG consumption. Therefore by default this data is not needed.
Since this output is meant for evaluation use this PR choose to use an environment variable to control if it should be present in the partitioned results. This approach avoids adding parameters to the partition function call. Adding a new parameter to the partition interface increases the complexity of the interface and adds more maintenance cost since there is a long chain of function calls to pass down this parameter to where it is needed.
test
running the following code snippet on main vs. this PR
from unstructured.partition.auto import partition
elements = partition("example-docs/layout-parser-paper-with-table.pdf", strategy="hi_res", skip_infer_table_types=[])
table_cells = [element.metadata.table_as_cells, None) for element in elements if element.category == "Table"]
on main branch table_cells contains cell structured data but on this branch it is a list of None
However if we first set in terminal:
export EXTRACT_TABLE_AS_CELLS=true
then run the same code again with this PR the table_cells would contain actual data, the same as on main branch.
This PR changes the output of table elements: now by default the table elements'
metadata.table_as_cells
isNone
. The data will only be populated when the envEXTRACT_TABLE_AS_CELLS
is set totrue
.The original design of the
table_as_cells
is for evaluate table extraction performance. The format itself is not as readable as thetable_as_html
metadata for human or RAG consumption. Therefore by default this data is not needed.Since this output is meant for evaluation use this PR choose to use an environment variable to control if it should be present in the partitioned results. This approach avoids adding parameters to the
partition
function call. Adding a new parameter to thepartition
interface increases the complexity of the interface and adds more maintenance cost since there is a long chain of function calls to pass down this parameter to where it is needed.test
running the following code snippet on main vs. this PR
on main branch
table_cells
contains cell structured data but on this branch it is a list ofNone
However if we first set in terminal:
then run the same code again with this PR the
table_cells
would contain actual data, the same as on main branch.