Unstructured-IO / unstructured

Open source libraries and APIs to build custom preprocessing pipelines for labeling, training, or production machine learning pipelines.
https://www.unstructured.io/
Apache License 2.0
8.62k stars 704 forks source link

feat/include_location_for_xlsx_partition #3236

Open marctorsoc opened 3 months ago

marctorsoc commented 3 months ago

Is your feature request related to a problem? Please describe. I started using eparse, a wrapper for unstructured. There, they added the RC (row-column) of the top-left cell for the element, see here. That's added to element.metadata.data_source.record_locator.

Describe the solution you'd like I'd like to have the same here

Describe alternatives you've considered I prefer to use unstructured than eparse, since for tables like

table name     |
header 1         |      header2      |          header 3    |
value11             |      value12     |          header 13   |
value21             |      value22     |          header 23   |

I'm getting just the first column when using eparse. Using unstructured I get a title and a table. So this works better for my use case.

Additional context In addition to better partitioning, I'd like to have the location so I can assign the title as the name of the table.

I'm happy to contribute myself, but checking whether the owners would be happy with this contribution and / or there's something already in the library that I missed.