Unstructured-IO / unstructured

Open source libraries and APIs to build custom preprocessing pipelines for labeling, training, or production machine learning pipelines.
https://www.unstructured.io/
Apache License 2.0
9.32k stars 773 forks source link

feat/custom-metadata #3079

Open streamnsight opened 6 months ago

streamnsight commented 6 months ago

A question, and a feature request if there is no easy answer. I want to add custom metadata to my documents, for example a uuid for each doc from some custom mapping function.

Not sure what the 'right solution' is, but it seems like having a 'metadata phase' in the pipeline where we get access to the doc and some mapping function to generate and inject metadata

Describe alternatives you've considered I considered pulling from a datasource like OpenSearch (i see i can get meta and doc id from OpenSearch as default) but I have binary full docs to be parsed, while it seems like the OpenSearch connector only expects text documents to partition/chunk. I would want the OpenSearch connector to allow me to retrieve the binary doc, and then parse it as per its type (docx, pptx, pdf, html...)

maybe there are solutions, I haven't found in the docs.

scanny commented 6 months ago

How are you using unstructured? Like directly via calls to partition_{doctype}() in Python? Or maybe using the ingest CLI or the API?

streamnsight commented 6 months ago

@scanny I am using the Python methods, using a runner, with partionner etc... to process batches of docs. I have docx, pptx, html and pdfs docs, which I want to parse, OCR when there are images etc... and chunk but I also need the full docs or page by page index.

for context, I'm doing RAG with the docs, with hybrid search. In hybrid search, we want to do semantic search on embedded chunks, but it is not effective to do lexical search on those same chunks, because lexical search gives best results when done on larger size docs (because TF-IDF can promote the weight of the uncommon words, but can't do that well on small chunks) So, I want/need multiple indices for each doc in the end:

scanny commented 6 months ago

@streamnsight it sounds like you would like access to interim pipeline work-product, like maybe being able to register an event handler like after_partitioning(elements: list[Element]) -> list[Element] function that gets called by the pipeline once for each document after it's been partitioned or something, does that sound about right? Such that you could enhance the metadata for each element before it proceeds to the next step?

Nothing special about the partitioning step of course, there could be "after" events and maybe "before" events for each step in the pipeline with document granularity I suppose.

Is that the sort of thing you're proposing?

scanny commented 6 months ago

@streamnsight can you provide a sketch of the code you're using? Not all the details, just the top-level. like 10-15 lines at most with ellipses for lower-level details.

I have another idea that might work with the current capability. Basically breaking the pipeline into two sections where you partition the batch to disk, make changes to that copy, then feed them into the "tail-end" pipeline.

streamnsight commented 6 months ago

@scanny It's really very simple:

            processor_config=ProcessorConfig(...),
            connector_config=SimpleLocalConfig(
                input_path=folder,
                recursive=True,
                file_glob=['*.pdf', '*.docx', '*.pptx']
            ),
            read_config=ReadConfig(),
            partition_config=PartitionConfig(...),
            chunking_config=ChunkingConfig(...),
            embedding_config=EmbeddingConfig(...),
            writer=OpenSearchWriter(...),
            writer_kwargs={},
        )

This is for semantic search chunks. For the other indices I actually just load and dump into OpenSearch, with an extra step with PDFReader to extract the pages / page text.

Ideally, I'd want a system that can do

Read from file --─> Parse -> chunk -> embed -> Write to 'chunks' (for semantic search)
               ├─> Parse -> split on page -> extract page text -> Write binary encoded page + text / meta to 'pages' (for lexical search)
               └-> binary encode -> Write to index 'full-docs'

or eventually I can do

Read from file -> binary encode -> Write to index OpenSearch 'full-docs' (with meta uuid ...)

then

Read from 'full-docs' -> get object + meta -> parse binary encoded object -> chunk -> embed -> write to 'chunks'
Read from 'full-docs' -> get object + meta -> parse binary encoded object -> split on page -> write to 'pages'

so the pipeline is linear (as I understand unstructured runner do not run a DAG but a linear set of stages.)

Anyway, from what I understand, for this kind of flexibility I should just be importing the raw methods and build my own DAG / pipeline.

scanny commented 6 months ago

I think you'll need a certain amount of your own "pipeline" code, just a question of how much leverage you can get from what's already there.

I'm not an expert on the ingest code, but from what I can tell there are 3 + N steps:

So the flexibility afforded is in source and destinations chosen, the 0..N reformatter steps (includes chunking, embedding, etc.) specified, but also in the possibility of using "null" steps.

So by combining "partial" runs or creating custom reformatter steps perhaps you can get the results you're after.