Unstructured-IO / unstructured

Open source libraries and APIs to build custom preprocessing pipelines for labeling, training, or production machine learning pipelines.
https://www.unstructured.io/
Apache License 2.0
8.54k stars 699 forks source link

bug/elements_from_json drops coordinates #2058

Closed wpm closed 10 months ago

wpm commented 10 months ago

Describe the bug The unstructured.staging.base.elements_from_json function does not copy coordinate information from JSON into Elements objects.

To Reproduce

import requests
from unstructured.staging.base import elements_from_json

url = "http://localhost:8000/general/v0/general"

headers = {"accept": "application/json"}
data = {"strategy": "hi_res", "coordinates": True}

file_path = "my.pdf"
file_data = {"files": open(file_path, "rb")}

response = requests.post(url, headers=headers, data=data, files=file_data)

file_data["files"].close()

# The REST API returns coordinates...
assert response.json()[0]["coordinates"]
# ...but the coordinates are not copied into the Element objects.
assert elements_from_json(text=response.text)[0].metadata.coordinates is None

Expected behavior I expect Element.metadata.coordinates to contain the coordinates from the JSON.

Environment Info

OS version:  Linux-6.4.16-linuxkit-x86_64-with-glibc2.17
Python version:  3.8.15
unstructured version:  0.6.6
unstructured-inference version:  0.4.4
pytesseract version:  0.3.10
Torch version:  2.0.1
Detectron2 version:  0.6

[notice] A new release of pip available: 22.2.1 -> 23.3.1
[notice] To update, run: pip install --upgrade pip

[notice] A new release of pip available: 22.2.1 -> 23.3.1
[notice] To update, run: pip install --upgrade pip
PaddleOCR is not installed
Traceback (most recent call last):
  File "collect_env.py", line 242, in <module>
    main()
  File "collect_env.py", line 224, in main
    libmagic_version = get_libmagic_version()
  File "collect_env.py", line 146, in get_libmagic_version
    result = subprocess.run(
  File "/usr/local/lib/python3.8/subprocess.py", line 493, in run
    with Popen(*popenargs, **kwargs) as process:
  File "/usr/local/lib/python3.8/subprocess.py", line 858, in __init__
    self._execute_child(args, executable, preexec_fn, close_fds,
  File "/usr/local/lib/python3.8/subprocess.py", line 1704, in _execute_child
    raise child_exception_type(errno_num, err_msg, err_filename)

Additional context I am running the Docker image quay.io/unstructured-io/unstructured-api:latest. SHA 55bc0476

The problem appears to be that unstructured.staging.base.isd_to_elements never reads the coordinates from the JSON.

scanny commented 10 months ago

@wpm can you post a snippet of the response containing the coordinates, what you're testing `response.json()[0]["coordinates"] on?

The first thing that jumps out to me is that response.json()[0] I believe is an element, whereas elements[0].metadata is an ElementMetadata object (held in the .metadata field of an Element object.

scanny commented 10 months ago

@wpm when I run (a reasonable facsimile of) your code against the public API I get this for the first element (different doc of course):

    {
        "type": "Title",
        "element_id": "3f20b5ce0bd588812210532be7bdb0c4",
        "metadata": {
            "coordinates": {
                "points": [
                    [
                        572.2222222222222,
                        195.45391845703125
                    ],
                    [
                        572.2222222222222,
                        285.3270568847656
                    ],
                    [
                        1081.5888888888887,
                        285.3270568847656
                    ],
                    [
                        1081.5888888888887,
                        195.45391845703125
                    ]
                ],
                "system": "PixelSpace",
                "layout_width": 1653,
                "layout_height": 2339
            },
            "filename": "file-sample_150kB.pdf",
            "filetype": "application/pdf",
            "page_number": 1
        },
        "text": "Lorem ipsum"
    },

Which is what I would expect to see.

Note that assert response.json()[0]["coordinates"] fails with KeyError: 'coordinates', as I would also expect.

I'm going to close this as could-not-reproduce, but if you think I've missed something feel free to reopen :)

wpm commented 10 months ago

The first assert should have been assert response.json()[0]["metadata"]["coordinates"].

I pulled a later Unstructured API image: downloads.unstructured.io/unstructured-io/unstructured-api:0.0.57 SHA af94830c. It has the following environment:

OS version:  Linux-6.4.16-linuxkit-x86_64-with-glibc2.34
Python version:  3.10.13
unstructured version:  0.10.29
unstructured-inference version:  0.7.11
pytesseract version:  0.3.10
Torch version:  2.1.0
Detectron2 is not installed

Everything works as expected now. It looks like I was running an outdated version of Unstructured.

scanny commented 10 months ago

Ah, oh yes, I should have looked more closely at the versions you provided :) Anyway, glad you got it working @wpm :)