Unstructured-IO / unstructured

Open source libraries and APIs to build custom preprocessing pipelines for labeling, training, or production machine learning pipelines.
https://www.unstructured.io/
Apache License 2.0
8.44k stars 692 forks source link

feat/ enable partition_json to partition any json #1038

Open Coniferish opened 1 year ago

Coniferish commented 1 year ago

Currently partition_json is intended only for deserializing the unstructured JSON outputs/elements and is not included as a file format we accept for partitioning (see here).

The goal of this issue is to make partition_json work for any JSON file (probably similar to how partition_xml works)

lironrisk commented 8 months ago

any news?

Coniferish commented 8 months ago

any news?

Hey @lironrisk ! Apologies for the delay. We've been really busy with the launch of the paid api. @scanny has some preliminary code for this, but it is lower priority than the chunking improvements we have coming up. It has been added to our roadmap for Q2.

orlandounstructured commented 7 months ago

Over 180 days old but keeping open due to addition to roadmap, as mentioned by @Coniferish

adrianruchti commented 1 month ago

Hello unstructured Team. Would be interested in the partition_json.

apmavrin commented 1 month ago

We, so far, have implemented a helper function that parses the json, wraps it into a list and feeds it Unstructured.IO, so it could be parsed with the current version.

utility function

from typing import Dict, Optional

from google.protobuf.struct_pb2 import Struct

def struct_to_dict(struct: Struct, out: Optional[Dict] = None) -> Dict:
    if not out:
        out = {}
    for key, value in struct.items():
        if isinstance(value, Struct):
            out[key] = struct_to_dict(value)
        else:
            out[key] = value
    return out

the main function where we receive json and feed to to Unstructured

# TLDR: Unstructured can't handle a single JSON, but needs a List of JSON
parsed_list = [struct_to_dict(json_content_from_api)]
content_for_unstructured = json.dumps(parsed_list)
adrianruchti commented 1 month ago

thank you. And the content_for_unstructured can then be parsed by the unstructured partition function in the unstructured pipeline?

`runner = LocalRunner( processor_config=ProcessorConfig( verbose=True, output_dir="local-ingest-output", num_processes=2, ), read_config=ReadConfig(), partition_config=PartitionConfig( partition_by_api=False ),

        connector_config=SimpleLocalConfig(
            input_path="data",
            recursive=True,
        ),
        chunking_config=ChunkingConfig(
            chunk_elements=True
        ),

    )
    #      writer=writer,
    #      writer_kwargs={},

    # # Run the DropboxRunner
    runner.run()`
scanny commented 1 month ago

@apmavrin @adrianruchti can you say a little more about your use case for this? It's not clear to me yet how a useful JSON partitioner would behave.

In particular:

Any help you can give characterizing the use cases will help in developing a spec for something like this. Grateful for whatever help you can provide :)

adrianruchti commented 1 month ago

@scanny the use case: 60 mb law documents (xml format). Was trying unstructured xml parser and azure document intelligence. The extraction and recognition of different types is not satisfying with the xml parser. So I parsed the xml with lxml etree and saved it as json. This worked well with azure document intelligence that can partition and chunk json. Would be nice to get the same from unstructured as this would be a cheaper solution. By the way: unstructured converts the documents to json afte Partitioning so why not accepting this format in your partitioner. Do you have any platform I could share the big xml and converted json file in private with you Steve? (If you are interested)

scanny commented 1 month ago

Hi @adrianruchti you can DM me on the Unstructured Slack channel or reach me at the email on my GitHub profile. No need for a huge document just yet but a modest sized one might be helpful.

The big question for me is schema information, like should an XML partitioner take some schema descriptors to determine what to partition and how, or should it be made to do the best it can without any schema information (like partition all text it finds) or possibly both.

Just glancing at some LegalXML documents online, it looks like there is a lot of metadata and only a little narrative text. So if you can help me understand what a successful outcome in your case would be maybe that will help us noodle this a bit further.

Also if you can give a sense of the diversity of XML-vocabularies/schemas you need to partition, that would help in reasoning about it.

One aspect of an approach that occurs to me is applying a legal-document-type-specific XSLT transform to produce a "standardized" XML document that partitions in a well-known way, including adding whatever "extra" metadata you might want on the partitioned elements. Not sure how that fits in but thought I'd mention in the interest of brainstorming :)