feat/custom-metadata - Githubissues

streamnsight commented 6 months ago

A question, and a feature request if there is no easy answer. I want to add custom metadata to my documents, for example a uuid for each doc from some custom mapping function.

Not sure what the 'right solution' is, but it seems like having a 'metadata phase' in the pipeline where we get access to the doc and some mapping function to generate and inject metadata

Describe alternatives you've considered I considered pulling from a datasource like OpenSearch (i see i can get meta and doc id from OpenSearch as default) but I have binary full docs to be parsed, while it seems like the OpenSearch connector only expects text documents to partition/chunk. I would want the OpenSearch connector to allow me to retrieve the binary doc, and then parse it as per its type (docx, pptx, pdf, html...)

maybe there are solutions, I haven't found in the docs.

scanny commented 6 months ago

How are you using unstructured? Like directly via calls to partition_{doctype}() in Python? Or maybe using the ingest CLI or the API?

streamnsight commented 6 months ago

@scanny I am using the Python methods, using a runner, with partionner etc... to process batches of docs. I have docx, pptx, html and pdfs docs, which I want to parse, OCR when there are images etc... and chunk but I also need the full docs or page by page index.

for context, I'm doing RAG with the docs, with hybrid search. In hybrid search, we want to do semantic search on embedded chunks, but it is not effective to do lexical search on those same chunks, because lexical search gives best results when done on larger size docs (because TF-IDF can promote the weight of the uncommon words, but can't do that well on small chunks) So, I want/need multiple indices for each doc in the end:

the full doc, which can be referenced and downloaded
page by page docs text, which can be used for lexical search
chunks, embedded, for semantic search. Problem is, the only common thing I have to work with is the filename, and this is not a very robust way of referencing documents. Especially in my case I have multiple languages, and files names sometimes get weird depending on encoding set on the machine processing them. Also, sometimes the docs are translated, and the filename is also translated, yet I want to reference it as the same doc id. Ideally I want a unique uuid for each doc. Ideally I want to process a single doc in multiple ways (full doc binary, page by page binary and text, embedded chunks) and send them to multiple indices. I don't see a way to do this but by processing the same file with multiple pipelines, but then the globing, etc... is outside of my control, and so is the id set for the doc. I'm thinking if I can inject some metadata somehow, I'd be able to set a uuid as metadata, it would be a more robust way to reference the file in the final datastore, in any index, than using the filename.

scanny commented 6 months ago

@streamnsight it sounds like you would like access to interim pipeline work-product, like maybe being able to register an event handler like after_partitioning(elements: list[Element]) -> list[Element] function that gets called by the pipeline once for each document after it's been partitioned or something, does that sound about right? Such that you could enhance the metadata for each element before it proceeds to the next step?

Nothing special about the partitioning step of course, there could be "after" events and maybe "before" events for each step in the pipeline with document granularity I suppose.

Is that the sort of thing you're proposing?

scanny commented 6 months ago

@streamnsight can you provide a sketch of the code you're using? Not all the details, just the top-level. like 10-15 lines at most with ellipses for lower-level details.

I have another idea that might work with the current capability. Basically breaking the pipeline into two sections where you partition the batch to disk, make changes to that copy, then feed them into the "tail-end" pipeline.

streamnsight commented 6 months ago

@scanny It's really very simple:

            processor_config=ProcessorConfig(...),
            connector_config=SimpleLocalConfig(
                input_path=folder,
                recursive=True,
                file_glob=['*.pdf', '*.docx', '*.pptx']
            ),
            read_config=ReadConfig(),
            partition_config=PartitionConfig(...),
            chunking_config=ChunkingConfig(...),
            embedding_config=EmbeddingConfig(...),
            writer=OpenSearchWriter(...),
            writer_kwargs={},
        )

This is for semantic search chunks. For the other indices I actually just load and dump into OpenSearch, with an extra step with PDFReader to extract the pages / page text.

Ideally, I'd want a system that can do

Read from file --─> Parse -> chunk -> embed -> Write to 'chunks' (for semantic search)
               ├─> Parse -> split on page -> extract page text -> Write binary encoded page + text / meta to 'pages' (for lexical search)
               └-> binary encode -> Write to index 'full-docs'

or eventually I can do

Read from file -> binary encode -> Write to index OpenSearch 'full-docs' (with meta uuid ...)

then

Read from 'full-docs' -> get object + meta -> parse binary encoded object -> chunk -> embed -> write to 'chunks'
Read from 'full-docs' -> get object + meta -> parse binary encoded object -> split on page -> write to 'pages'

so the pipeline is linear (as I understand unstructured runner do not run a DAG but a linear set of stages.)

Anyway, from what I understand, for this kind of flexibility I should just be importing the raw methods and build my own DAG / pipeline.

scanny commented 6 months ago

I think you'll need a certain amount of your own "pipeline" code, just a question of how much leverage you can get from what's already there.

I'm not an expert on the ingest code, but from what I can tell there are 3 + N steps:

get documents from source, e.g. download them from s3 or load them from local filesystem
partition the documents, each to a list[Element] serialized to JSON
run N reformatter steps
write the documents to the destination, e.g. a database or whatever.

So the flexibility afforded is in source and destinations chosen, the 0..N reformatter steps (includes chunking, embedding, etc.) specified, but also in the possibility of using "null" steps.

There is a download_only option in the source config somewhere which allows the prior-to-partitioning steps to be run by themselves.
Partitioning a JSON file simply deserializes the elements in that JSON file, so this essentially skips the partitioning step if you have some customized way to produce the list[Element] normally produced by a partitioner as JSON.
I expect there are ways to add customized reformatter steps by implementing a defined interface.

So by combining "partial" runs or creating custom reformatter steps perhaps you can get the results you're after.

Unstructured-IO / unstructured

feat/custom-metadata #3079