Closed wpm closed 10 months ago
@wpm can you post a snippet of the response containing the coordinates, what you're testing `response.json()[0]["coordinates"] on?
The first thing that jumps out to me is that response.json()[0]
I believe is an element, whereas elements[0].metadata
is an ElementMetadata object (held in the .metadata
field of an Element
object.
@wpm when I run (a reasonable facsimile of) your code against the public API I get this for the first element (different doc of course):
{
"type": "Title",
"element_id": "3f20b5ce0bd588812210532be7bdb0c4",
"metadata": {
"coordinates": {
"points": [
[
572.2222222222222,
195.45391845703125
],
[
572.2222222222222,
285.3270568847656
],
[
1081.5888888888887,
285.3270568847656
],
[
1081.5888888888887,
195.45391845703125
]
],
"system": "PixelSpace",
"layout_width": 1653,
"layout_height": 2339
},
"filename": "file-sample_150kB.pdf",
"filetype": "application/pdf",
"page_number": 1
},
"text": "Lorem ipsum"
},
Which is what I would expect to see.
Note that assert response.json()[0]["coordinates"]
fails with KeyError: 'coordinates'
, as I would also expect.
I'm going to close this as could-not-reproduce, but if you think I've missed something feel free to reopen :)
The first assert should have been assert response.json()[0]["metadata"]["coordinates"]
.
I pulled a later Unstructured API image: downloads.unstructured.io/unstructured-io/unstructured-api:0.0.57 SHA af94830c
. It has the following environment:
OS version: Linux-6.4.16-linuxkit-x86_64-with-glibc2.34
Python version: 3.10.13
unstructured version: 0.10.29
unstructured-inference version: 0.7.11
pytesseract version: 0.3.10
Torch version: 2.1.0
Detectron2 is not installed
Everything works as expected now. It looks like I was running an outdated version of Unstructured.
Ah, oh yes, I should have looked more closely at the versions you provided :) Anyway, glad you got it working @wpm :)
Describe the bug The
unstructured.staging.base.elements_from_json
function does not copy coordinate information from JSON intoElements
objects.To Reproduce
Expected behavior I expect
Element.metadata.coordinates
to contain the coordinates from the JSON.Environment Info
Additional context I am running the Docker image
quay.io/unstructured-io/unstructured-api:latest. SHA 55bc0476
The problem appears to be that
unstructured.staging.base.isd_to_elements
never reads the coordinates from the JSON.