deepset-ai / haystack

:mag: AI orchestration framework to build customizable, production-ready LLM applications. Connect components (models, vector DBs, file converters) to pipelines or agents that can interact with your data. With advanced retrieval methods, it's best suited for building RAG, question answering, semantic search or conversational agent chatbots.
https://haystack.deepset.ai
Apache License 2.0
16.94k stars 1.85k forks source link

fix: `SentenceWindowRetriever` convert metadata fields back to int #8304

Closed davidsbatista closed 1 month ago

davidsbatista commented 1 month ago

Proposed Changes:

Pinecone converts all meta numbers in the meta field to float (https://docs.pinecone.io/guides/data/filter-with-metadata). This causes the SentenceWindowRetriever to crash completely.

This PR checks if the metadata values are floats and converts them back to integers making PineCone supported by the SentenceWindowRetriever

How did you test it?

Checklist

coveralls commented 1 month ago

Pull Request Test Coverage Report for Build 10594089336

Warning: This coverage report may be inaccurate.

This pull request's base commit is no longer the HEAD commit of its target branch. This means it includes changes from outside the original pull request, including, potentially, unrelated coverage changes.

Details


Totals Coverage Status
Change from base Build 10592675152: 0.2%
Covered Lines: 7021
Relevant Lines: 7770

💛 - Coveralls
anakin87 commented 1 month ago

Since this change is related to how Pinecone handles metadata, I think it would be simpler and more appropriate to intervene on the Pinecone side.

Here, for example: https://github.com/deepset-ai/haystack-core-integrations/blob/50352b95e0caeedeb5779eeec3b6a7bd79542d80/integrations/pinecone/src/haystack_integrations/document_stores/pinecone/document_store.py#L270

WDYT?

davidsbatista commented 1 month ago

It's a good suggestion, but we can't just blindly convert everything back to integers.

To make this generic, we need to know which type the values were originally, i.e.: before being stored in Pinecone; and I don't know where to store that information.

anakin87 commented 1 month ago

Yes, I understand.

I'm simply suggesting that we move the _convert_to_int method to Pinecone with the same keys used here.

Unrelated: I have the impression that the fact that our DocumentSplitter creates a page_number meta field poses risks of overriding user-provided information.

davidsbatista commented 1 month ago

``

Yes, I understand.

I'm simply suggesting that we move the _convert_to_int method to Pinecone with the same keys used here.

ah ok, now I understand what you suggested - the only thing is that for the tests we will need to import Pinecone/add them to the CI

anakin87 commented 1 month ago

ah ok, now I understand what you suggested - the only thing is that for the tests we will need to import Pinecone/add them to the CI

No, I suggest modifying the Pinecone Document Store in core-integrations.

Here: https://github.com/deepset-ai/haystack-core-integrations/blob/50352b95e0caeedeb5779eeec3b6a7bd79542d80/integrations/pinecone/src/haystack_integrations/document_stores/pinecone/document_store.py#L270

davidsbatista commented 1 month ago

moving to core-integrations