Unstructured-IO / unstructured

Open source libraries and APIs to build custom preprocessing pipelines for labeling, training, or production machine learning pipelines.
https://www.unstructured.io/
Apache License 2.0
7.37k stars 572 forks source link

Feat: Add-rc-locator-to-partition-excel #3258

Open marctorsoc opened 1 week ago

marctorsoc commented 1 week ago

As described in https://github.com/Unstructured-IO/unstructured/issues/3236, I wanted to add the coordinate RC to the metadata of the returned elements. I'm adding both rc and excel_rc. The former being 0-based, the latter 1-based (as in Excel).

In addition, I'm changing the behaviour of starting_page_number, to skip sheets if required. I don't think the previous feature was very useful, but happy to be convinced. Still, it'd be nice to have a way to specify which sheets we care about.