PineconeDocumentStore raises error to the metadata produced by DocumentSplitter

bilgeyucel commented 2 months ago

Describe the bug PineconeDocumentStore raises an error when I try to index a document that was split by DocumentSplitter. Error message 👇

PineconeApiException: (400)
Reason: Bad Request
HTTP response headers: HTTPHeaderDict({'Date': 'Tue, 23 Jul 2024 12:46:03 GMT', 'Content-Type': 'application/json', 'Content-Length': '160', 'Connection': 'keep-alive', 'x-pinecone-request-latency-ms': '903', 'x-pinecone-request-id': '2298458388900737762', 'x-envoy-upstream-service-time': '37', 'server': 'envoy'})
HTTP response body: {"code":3,"message":"Metadata value must be a string, number, boolean or list of strings, got '[{\"doc_id\":\"22e0...' for field '_split_overlap'","details":[]}

Document object that raises the error is below. "_split_overlap" seems to be a list of dict

Document(id=37fa03ca409f457046696a3bec987d5cb627f655cbcf0c019f7334bc170da4b8, content: 'Vegan Persimmon Flan

Recipe  by Tilde Thurium

This makes 2 servings. Why did I write a recipe that...', meta: {'file_path': '/content/recipe_files/vegan_flan_recipe.md', 'source_id': 'a01a0ae2f396930e9cd3475986ae716cb26c554f6b49d4c61dfeb473ddeb7ced', 'page_number': 1, 'split_id': 0, 'split_idx_start': 0, '_split_overlap': [{'doc_id': '0520d3c17150c5fd057a19bdc796e9f9c3a632f1d9acf730154d888ee3fc86be', 'range': (0, 305)}]})

To Reproduce

import os

os.environ["PINECONE_API_KEY"] = "PINECONE-KEY"

from haystack_integrations.document_stores.pinecone import PineconeDocumentStore

document_store = PineconeDocumentStore(
    index="<ENTER_PINECONE_INDEX_NAME>",
    namespace="<ENTER_PINECONE-PROJECT-NAME>",
    dimension=1536,
    spec={"serverless": {"region": "us-east-1", "cloud": "aws"}},
)

from haystack.components.preprocessors import DocumentSplitter
from haystack import Document

source_docs = [Document(content="""
Vegan Persimmon Flan
Recipe by Tilde Thurium
This makes 2 servings. Why did I write a recipe that only makes 2 servings? It was the height of COVID, okay, don't judge me.
Tools:
2 ramekins
Blender
Ingredients:
½ cup persimmon pulp, strained. This takes 2 average sized fuyu persimmons. If they have seeds, remove them.
1 tbsp cornstarch
½ tsp agar agar
1 tbsp agave nectar, or to taste
2 tbsp granulated sugar
¼ cup coconut creme
½ cup almond milk
½ tsp vanilla
Steps
I tried making caramel with the [Full Of Plants](https://www.google.com/url?q=https%3A%2F%2Ffullofplants.com%2Feasy-vegan-caramel-sauce%2F) method but it was a pain in the ass and I burned myself.
For this recipe, just put the sugar at the bottom of the cup and it somehow magically turns into sauce. Lifehack!
Combine the cornstarch with the almond milk and stir it in.
whisk persimmon pulp, milk/cornstarch, agar agar, coconut creme, and agave in a saucepan. Bring to a boil.
The persimmon pulp got a little congealed, so I mixed it with an immersion blender. But you do you, boo.
Let the persimmon mixture cool a bit, for maybe 5 minutes. Stir in the vanilla. Pour it in to your ramekins or what have you.
Don’t forget and let it cool to room temperature. Agar agar waits for no man.
Refrigerate for at least 4 hours, or overnight.
To remove from ramekin, try the hot water bath method (didn’t work for me, maybe the water wasn’t hot enough.) Or just run a knife along the edges of the ramekin and jiggle it out.""")]

document_splitter = DocumentSplitter(split_by="word", split_length=40, split_overlap=10)
split_docs = document_splitter.run(documents=source_docs)
document_store.write_documents(documents=split_docs["documents"])

Describe your environment (please complete the following information):

OS: Colab
Haystack version: 2.3
Integration version: 1.2.1

anakin87 commented 2 months ago

To fix this, we can follow an approach similar to https://github.com/deepset-ai/haystack-core-integrations/pull/907

But at this point, I also have doubts about the format produced by the DocumentSplitter, which seems not to be compatible with several Document Stores.

bilgeyucel commented 2 months ago

IMO, fixing DocumentSplitter is a better solution. #907 seems more like a workaround

anakin87 commented 2 months ago

I think that for Document Stores that greatly limit the types of metadata values allowed, discarding invalid metadata and warning the user may be a good approach. E.g., Chroma only supports str, int, float, bool. How can we store this structured information?

However, I agree with you that we should think of better choices for _split_overlap type.

deepset-ai / haystack-core-integrations

PineconeDocumentStore raises error to the metadata produced by DocumentSplitter #919