Unstructured-IO / unstructured

Open source libraries and APIs to build custom preprocessing pipelines for labeling, training, or production machine learning pipelines.
https://www.unstructured.io/
Apache License 2.0
7.5k stars 585 forks source link

bug/Unable to parse xlsx files #2326

Open sundaraa-deshaw opened 6 months ago

sundaraa-deshaw commented 6 months ago

Describe the bug Using the unstructured library (via Langchain) and directly to read an excel file (.xlsx). Fails during the partition step.

To Reproduce from unstructured.partition.xlsx import partition_xlsx partition_xlsx("./new1.xlsx")

Expected behavior Expected the file to be partitioned.

Screenshots This is the file content. It is a simple xlsx file. image

Environment Info Am using unstructured-0.11.6, which is a transitive dependency from langchain.

Additional context Error trace: image

lfc07 commented 5 months ago

the same problem for me.