deepset-ai / haystack-core-integrations

Additional packages (components, document stores and the likes) to extend the capabilities of Haystack version 2.0 and onwards
https://haystack.deepset.ai
Apache License 2.0
118 stars 118 forks source link

_original_id should not be required in weaviate with Haystack 2.6 #1171

Open bwbw723 opened 6 days ago

bwbw723 commented 6 days ago

I am using the WeaviateEmbeddingRetriever to work with the data. It works fine with the default class in weaviate. Once I change it to the data class created by myself with customized schema, I got the issue as below:

  File "/root/TS_ph3/00_WeaviateEmbeddingRetriever.py", line 70, in <module>
    result = query_pipeline.run({"text_embedder": {"text": query}})
  File "/root/.cache/pypoetry/virtualenvs/search-infra-7HLB3Aeo-py3.10/lib/python3.10/site-packages/haystack/core/pipeline/pipeline.py", line 229, in run
    res: Dict[str, Any] = self._run_component(name, components_inputs[name])
  File "/root/.cache/pypoetry/virtualenvs/search-infra-7HLB3Aeo-py3.10/lib/python3.10/site-packages/haystack/core/pipeline/pipeline.py", line 67, in _run_component
    res: Dict[str, Any] = instance.run(**inputs)
  File "/root/.cache/pypoetry/virtualenvs/search-infra-7HLB3Aeo-py3.10/lib/python3.10/site-packages/haystack_integrations/components/retrievers/weaviate/embedding_retriever.py", line 138, in run
    documents = self._document_store._embedding_retrieval(
  File "/root/.cache/pypoetry/virtualenvs/search-infra-7HLB3Aeo-py3.10/lib/python3.10/site-packages/haystack_integrations/document_stores/weaviate/document_store.py", line 538, in _embedding_retrieval
    return [self._to_document(doc) for doc in result.objects]
  File "/root/.cache/pypoetry/virtualenvs/search-infra-7HLB3Aeo-py3.10/lib/python3.10/site-packages/haystack_integrations/document_stores/weaviate/document_store.py", line 538, in <listcomp>
    return [self._to_document(doc) for doc in result.objects]
  File "/root/.cache/pypoetry/virtualenvs/search-infra-7HLB3Aeo-py3.10/lib/python3.10/site-packages/haystack_integrations/document_stores/weaviate/document_store.py", line 306, in _to_document
    document_data["id"] = document_data.pop("_original_id")
KeyError: '_original_id'

I check the codes and find that the predefined function need to get data of _original_id and set it as the Document ID. I have updated the codes in document_store.py and set set document_data["id"] as generated UUID if the dataset does not have one. In this case, the expected results are shown. I do not think that the data in weaviate is forced to have the column as _original_id . But based on the current codes, it will return errors if no _original_id there. I prefer to have a if statement to handle the different cases. Please kindly correct me if any misunderstandings.

The packages I am using are: haystack-ai = "2.6.1" fastembed-haystack = "1.3.0" weaviate-client = "^4.9.0" weaviate-haystack = "^4.0.0"

    def _to_document(self, data: DataObject[Dict[str, Any], None]) -> Document:
        """
        Converts a data object read from Weaviate into a Document.
        """
        document_data = data.properties
        # The error is raised here and I just set document_data["id"] as generated UUID if the dataset does not have one.
        document_data["id"] = document_data.pop("_original_id") 
        if isinstance(data.vector, List):
            document_data["embedding"] = data.vector
        elif isinstance(data.vector, Dict):
            document_data["embedding"] = data.vector.get("default")
        else:
            document_data["embedding"] = None

        if (blob_data := document_data.get("blob_data")) is not None:
            document_data["blob"] = {
                "data": base64.b64decode(blob_data),
                "mime_type": document_data.get("blob_mime_type"),
            }

        # We always delete these fields as they're not part of the Document dataclass
anakin87 commented 2 days ago

The rationale behind this field is explained here: https://github.com/deepset-ai/haystack-core-integrations/blob/67e08d0b7e5a7f51f52bb0d40fe40b0ff2caf43a/integrations/weaviate/src/haystack_integrations/document_stores/weaviate/document_store.py#L276-L278

This is done to provide a robust default to users who don't need serious customization.

For simplicity, you can add include this field to your collection configuration: https://github.com/deepset-ai/haystack-core-integrations/blob/67e08d0b7e5a7f51f52bb0d40fe40b0ff2caf43a/integrations/weaviate/src/haystack_integrations/document_stores/weaviate/document_store.py#L40

Does this create problems?