Unstructured-IO / unstructured-api

Apache License 2.0
505 stars 108 forks source link

The table appears to be less satisfactory in Excel format compared to PDF. Is there any room for further improvement?” #421

Closed henjrchen closed 3 months ago

henjrchen commented 3 months ago

Describe the bug I originally expect the following output result was one content, but it turned out to be 2 contents. Is this an issue? Or is there any other way to solve it? Thanks very much

image

To Reproduce curl -X 'POST' 'https://api.unstructured.io/general/v0/general' -H 'accept: application/json' -H 'Content-Type: multipart/form-data' -H 'unstructured-api-key: xxxx' -F 'files=@table.xlsx' | jq -C . | less -R

table6.pdf table6.xlsx

Environment:

Additional context

tbs17 commented 3 months ago

hi @henjrchen , i used your code but with a bit of modification (see below) to output json file (attached below) and i see there are two elements to be returned: 1. element type is 'title' 2. element type is 'table'. When you say 2 contents, are you referring to the same 2 elements i mentioned? output.json

curl -X 'POST' 'https://api.unstructured.io/general/v0/general' -H 'accept: application/json' -H 'Content-Type: multipart/form-data' -H 'unstructured-api-key: xxx' -F 'files=@table6.xlsx' -o output.json

henjrchen commented 3 months ago

Hi @tbs17 Thank you for your quick reply. Yes, that’s what I meant. When using the PDF format, it returns a table, so in Excel, it appears as two elements, which exceeded my expectations. However, I found the ‘parent_id’ information, which allows me to link these two elements together. Thanks