Unstructured-IO / unstructured

Open source libraries and APIs to build custom preprocessing pipelines for labeling, training, or production machine learning pipelines.
https://www.unstructured.io/
Apache License 2.0
8.92k stars 733 forks source link

bug/infer_table_structure usage not used by all partitioners #1710

Closed rbiseck3 closed 2 months ago

rbiseck3 commented 1 year ago

Describe the bug Currently the infer_table_structure and skip_infer_table_types parameters of the auto() partition method only gets applied to images and pdf doc types. This should be taken into account for any other partitioners that extract table content.

To Reproduce Run a pptx doc (i.e. example-docs/layout-parser-paper-with-table.jpg) through the partition method and it can be viewed that no matter what the parameters are set to, the text_as_html is still populated with the table content.

Expected behavior Running the partitioner over any doc type with the right parameter combination will omit table data.

yuming-long commented 1 year ago

FYI in https://unstructured-io.github.io/unstructured/bricks/partition.html we have a table that shows all doc type with table support

scanny commented 11 months ago

@rbiseck3 are we sure that's the behavior we want?

To me, infer_table_structure means "take extra time to use inference to detect tabular structures from images where no explicit tabular structure is available".

In the case of DOCX, PPTX, and HTML, that explicit table structure is immediately available and there is no perceptible time penalty for computing the .text_as_html once we have the text.

I'm inclined to think this behavior is already just how we would want it. If an end-user has no use for .text_as_html they can simply ignore it, as they would most metadata I expect for any given element.

MthwRobinson commented 2 months ago

Closing, we did a refactor on skip_infer_table_types and infer_table_structure in the spring.