Closed 9876691 closed 4 months ago
Hi there, the issue here is that partitioning of arbitrary json files is not actually implemented yet (see this issue). When Unstructured receives json, it expects our internal structured format in order to do additional processing, and typically other files will return this error:
{
"detail": "Json schema does not match the Unstructured schema"
}
The address list here happens to match our basic structure, so it accepts it, but it doesn't find any of the expected Element objects.
One workaround is to pass this file as a .txt
instead. It's an ugly solution for now, and it doesn't understand structure, but it will give you a result:
curl 'localhost:8000/general/v0/general' \
--header 'Accept: application/json' \
--form 'files=@100_properties_no_duplicates.txt'
[
{
"type": "UncategorizedText",
"element_id": "245843abef9e72e7efac30138a994bf6",
"text": "[",
"metadata": {
"filename": "100_properties_no_duplicates.txt",
"languages": [
"eng"
],
"filetype": "text/plain"
}
},
{
"type": "UncategorizedText",
"element_id": "021fb596db81e6d02bf3d2586ee3981f",
"text": "{",
"metadata": {
"filename": "100_properties_no_duplicates.txt",
"languages": [
"eng"
],
"filetype": "text/plain"
}
},
{
"type": "UncategorizedText",
"element_id": "e5e6aee083eb467899adf36056325649",
"text": "\"ADDRESS\": \"201 MARINA EAST DRIVE SINGAPORE 029997\",",
"metadata": {
"filename": "100_properties_no_duplicates.txt",
"languages": [
"eng"
],
"filetype": "text/plain"
}
},
{
"type": "UncategorizedText",
"element_id": "a03412c72aad13b287e20a6767090ca8",
"text": "\"BLK_NO\": \"201\",",
"metadata": {
"filename": "100_properties_no_duplicates.txt",
"languages": [
"eng"
],
"filetype": "text/plain"
}
},
{
"type": "UncategorizedText",
"element_id": "219275a589adaad6ea4e06ba1d5b5b24",
"text": "\"BUILDING\": \"NIL\",",
"metadata": {
"filename": "100_properties_no_duplicates.txt",
"languages": [
"eng"
],
"filetype": "text/plain"
}
},
{
"type": "UncategorizedText",
"element_id": "6279819b52d4cf7472c2feb8b8370f23",
"text": "\"LATITUDE\": \"1.2835489039816599\",",
"metadata": {
"filename": "100_properties_no_duplicates.txt",
"languages": [
"eng"
],
"filetype": "text/plain"
}
},
Describe the bug Parsing this file produces no results. https://drive.google.com/drive/folders/1hHxQ5bkk8ozjxzd03JYUpI0YSLzdUIlt?usp=sharing
To Reproduce
Looks like unstructured returns an empty array for this.
Also fails with image
561b9d8
Environment: