deepset-ai / haystack

AI orchestration framework to build customizable, production-ready LLM applications. Connect components (models, vector DBs, file converters) to pipelines or agents that can interact with your data. With advanced retrieval methods, it's best suited for building RAG, question answering, semantic search or conversational agent chatbots.
https://haystack.deepset.ai
Apache License 2.0
17.97k stars 1.93k forks source link

Use of meta data of documents inside components #8483

Open rmrbytes opened 1 month ago

rmrbytes commented 1 month ago

Problem Description When using a set of documents in a component like DocumentSplitter esp in a pipeline, the current working is that the same parameters of the component like split_by, split_length etc are applied to all documents. But that may not always be the case, as it is for my need.

Suggested Solution The suggestion is to use the meta properties of the document as a potent way for the developer to pass dynamic parameters. Hence, if the meta data has a parameter the same as the component parameter (for e.g. "split_by" then that parameter will be taken for that document. Since all documents anyway work with the "content" field of each document while processing, it can extend it to the meta fields in case they exist.

Current Alternative Solution The need I am working on is a typical RAG pipeline. But due to the requirement that each file in the pipeline may want to be split in a different strategy, I am constraint to treat each document and its pre and post processing as a batch and I loop through the documents. Thus, it is not a batch of documents in a pipeline but a batch of pipelines with 1 document each.

Additional context I was told by some data scientists that the choice of a splitting strategy is based on the contents of the document and in their opinion a standard process to follow.

Thanks.

julian-risch commented 1 week ago

@rmrbytes I would suggest to use a conditional router component, which based on the Document's content, routes the Document to a different pre and post processing. An alternative could be a custom splitter component that based on the document content decides to use a different splitting strategy.

rmrbytes commented 1 week ago

@julian-risch : Thanks for your suggestions. When I mentioned that the splitter is based on the document content I did not mean that we apply a logic to determine that. I meant that the user would use that know-how to set the split type per document of a RAG pipeline.

Hence any solution like the current one I am using will involve looping through the data set - one element of which is split_by and the others being the file name etc etc. Something like below.

  split_docs = []
  for file in files:
  ...
   # define the splitter based on document meta
   document_splitter = DocumentSplitter(
       split_by=file['meta']['split_by'], 
       split_length=file['meta']['split_length'], 
       split_overlap=file['meta']['split_overlap'], 
        split_threshold=file['meta']['split_threshold']
    )
    # split the cleaned document
    splitter_res = document_splitter.run([cleaner_res['documents'][0]])
     # add the splits to the array
    split_docs.extend(splitter_res['documents'])

.. continue with the pipeline processing

I was wanting to avoid such a looping. The purposes of requesting it here is for the team to decide whether the suggestion is of value and if so to enhance it.