Closed rbiseck3 closed 2 months ago
FYI in https://unstructured-io.github.io/unstructured/bricks/partition.html we have a table that shows all doc type with table support
@rbiseck3 are we sure that's the behavior we want?
To me, infer_table_structure
means "take extra time to use inference to detect tabular structures from images where no explicit tabular structure is available".
In the case of DOCX, PPTX, and HTML, that explicit table structure is immediately available and there is no perceptible time penalty for computing the .text_as_html
once we have the text.
I'm inclined to think this behavior is already just how we would want it. If an end-user has no use for .text_as_html
they can simply ignore it, as they would most metadata I expect for any given element.
Closing, we did a refactor on skip_infer_table_types
and infer_table_structure
in the spring.
Describe the bug Currently the
infer_table_structure
andskip_infer_table_types
parameters of theauto()
partition method only gets applied to images and pdf doc types. This should be taken into account for any other partitioners that extract table content.To Reproduce Run a
pptx
doc (i.e.example-docs/layout-parser-paper-with-table.jpg
) through the partition method and it can be viewed that no matter what the parameters are set to, thetext_as_html
is still populated with the table content.Expected behavior Running the partitioner over any doc type with the right parameter combination will omit table data.