Unstructured-IO / unstructured-api

Apache License 2.0
446 stars 101 forks source link

Parsing JSON document for chunks produces empty array #366

Closed 9876691 closed 4 months ago

9876691 commented 5 months ago

Describe the bug Parsing this file produces no results. https://drive.google.com/drive/folders/1hHxQ5bkk8ozjxzd03JYUpI0YSLzdUIlt?usp=sharing

To Reproduce

docker run -it -p 8000:8000 downloads.unstructured.io/unstructured-io/unstructured-api:4ffd8bc
curl -X 'POST' \
'http://localhost:8000/general/v0/general' \
-H 'accept: application/json' \
-H 'Content-Type: multipart/form-data' \
-F 'files=@100_properties_no_duplicates.json' \
| jq -C .

Looks like unstructured returns an empty array for this.

Also fails with image 561b9d8

Environment:

awalker4 commented 5 months ago

Hi there, the issue here is that partitioning of arbitrary json files is not actually implemented yet (see this issue). When Unstructured receives json, it expects our internal structured format in order to do additional processing, and typically other files will return this error:

{
    "detail": "Json schema does not match the Unstructured schema"
}

The address list here happens to match our basic structure, so it accepts it, but it doesn't find any of the expected Element objects.

One workaround is to pass this file as a .txt instead. It's an ugly solution for now, and it doesn't understand structure, but it will give you a result:

curl 'localhost:8000/general/v0/general' \
--header 'Accept: application/json' \
--form 'files=@100_properties_no_duplicates.txt'
[
    {
        "type": "UncategorizedText",
        "element_id": "245843abef9e72e7efac30138a994bf6",
        "text": "[",
        "metadata": {
            "filename": "100_properties_no_duplicates.txt",
            "languages": [
                "eng"
            ],
            "filetype": "text/plain"
        }
    },
    {
        "type": "UncategorizedText",
        "element_id": "021fb596db81e6d02bf3d2586ee3981f",
        "text": "{",
        "metadata": {
            "filename": "100_properties_no_duplicates.txt",
            "languages": [
                "eng"
            ],
            "filetype": "text/plain"
        }
    },
    {
        "type": "UncategorizedText",
        "element_id": "e5e6aee083eb467899adf36056325649",
        "text": "\"ADDRESS\": \"201 MARINA EAST DRIVE SINGAPORE 029997\",",
        "metadata": {
            "filename": "100_properties_no_duplicates.txt",
            "languages": [
                "eng"
            ],
            "filetype": "text/plain"
        }
    },
    {
        "type": "UncategorizedText",
        "element_id": "a03412c72aad13b287e20a6767090ca8",
        "text": "\"BLK_NO\": \"201\",",
        "metadata": {
            "filename": "100_properties_no_duplicates.txt",
            "languages": [
                "eng"
            ],
            "filetype": "text/plain"
        }
    },
    {
        "type": "UncategorizedText",
        "element_id": "219275a589adaad6ea4e06ba1d5b5b24",
        "text": "\"BUILDING\": \"NIL\",",
        "metadata": {
            "filename": "100_properties_no_duplicates.txt",
            "languages": [
                "eng"
            ],
            "filetype": "text/plain"
        }
    },
    {
        "type": "UncategorizedText",
        "element_id": "6279819b52d4cf7472c2feb8b8370f23",
        "text": "\"LATITUDE\": \"1.2835489039816599\",",
        "metadata": {
            "filename": "100_properties_no_duplicates.txt",
            "languages": [
                "eng"
            ],
            "filetype": "text/plain"
        }
    },