deepset-ai / haystack-core-integrations

Additional packages (components, document stores and the likes) to extend the capabilities of Haystack version 2.0 and onwards
https://haystack.deepset.ai
Apache License 2.0
119 stars 118 forks source link

QdrantDocumentStore issue #642

Closed marygriffus closed 6 months ago

marygriffus commented 7 months ago

Describe the bug When creating a QdrantDocumentStore, if there is already a preexisting qdrant instance with data in it, and the params do not match precisely, recreate_index will destroy the old index and create a new one, blowing away the old data and making it so that incoming data does not match. Even after turning this off and updating settings to match our Qdrant params, I ran into an issue that seemed to be a mismatch between the QdrantDocumentStore and the QdrantClient.

To Reproduce Bring up a qdrant instance and start a collection with the config:

{
  "params": {
    "vectors": {
      "fast-all-minilm-l6-v2": {
        "size": 384,
        "distance": "Cosine"
      }
    },
    "shard_number": 1,
    "replication_factor": 1,
    "write_consistency_factor": 1,
    "on_disk_payload": true
  },
  "hnsw_config": {
    "m": 16,
    "ef_construct": 100,
    "full_scan_threshold": 10000,
    "max_indexing_threads": 0,
    "on_disk": false
  },
  "optimizer_config": {
    ...
  },
  "wal_config": {
    ...
  },
  "quantization_config": null
}

Then instantiate a QdrantDocumentStore like below and use it in a pipeline.

        document_store = QdrantDocumentStore(
            url=qdrant_host,
            port=qdrant_port,
            index=fusion_payload.datastore,
            embedding_dim=384,
            similarity="cosine",
            recreate_index=False,
            hnsw_config={"m": 16, "ef_construct": 100}
        )

At first, I had recreate_index=True and left embedding_dim and hnsw_config blank, which blew my collection away and any new data failed to be added; ideally I think the document store would default to the settings discovered through the qdrant client. However, when I switched to recreate_index=False, I continued to have issues.

The first error I ran into with this setup was this:

  File "/src/app/.venv/lib/python3.11/site-packages/haystack_integrations/document_stores/qdrant/document_store.py", line 138, in __init__
    self._set_up_collection(index, embedding_dim, recreate_index, similarity, on_disk, payload_fields_to_index)
  File "/src/app/.venv/lib/python3.11/site-packages/haystack_integrations/document_stores/qdrant/document_store.py", line 389, in _set_up_collection
    current_distance = collection_info.config.params.vectors.distance
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AttributeError: 'dict' object has no attribute 'distance'

I overrode those lines in _set_up_collection with these:

current_distance = collection_info.config.params.vectors["fast-all-minilm-l6-v2"].distance
current_vector_size = collection_info.config.params.vectors["fast-all-minilm-l6-v2"].size

and I then instead got this error:

...
  File "/src/app/.venv/lib/python3.11/site-packages/haystack_integrations/document_stores/qdrant/document_store.py", line 311, in query_by_embedding
    points = self.client.search(
             ^^^^^^^^^^^^^^^^^^^
  File "/src/app/.venv/lib/python3.11/site-packages/qdrant_client/qdrant_client.py", line 336, in search
    return self._client.search(
           ^^^^^^^^^^^^^^^^^^^^
  File "/src/app/.venv/lib/python3.11/site-packages/qdrant_client/qdrant_remote.py", line 497, in search
    search_result = self.http.points_api.search_points(
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/src/app/.venv/lib/python3.11/site-packages/qdrant_client/http/api/points_api.py", line 1388, in search_points
    return self._build_for_search_points(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/src/app/.venv/lib/python3.11/site-packages/qdrant_client/http/api/points_api.py", line 636, in _build_for_search_points
    return self.api_client.request(
           ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/src/app/.venv/lib/python3.11/site-packages/qdrant_client/http/api_client.py", line 76, in request
    return self.send(request, type_)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/src/app/.venv/lib/python3.11/site-packages/qdrant_client/http/api_client.py", line 99, in send
    raise UnexpectedResponse.for_response(response)
qdrant_client.http.exceptions.UnexpectedResponse: Unexpected Response: 400 (Bad Request)
Raw response content:
b'{"status":{"error":"Wrong input: Vector params for  are not specified in config"},"time":0.007150167}'

From the context I would expect this to be an issue with the embedding model, but there is no method to add the embedding model to the document store, so I might be misunderstanding.

Describe your environment (please complete the following information):

anakin87 commented 7 months ago

Hey @marygriffus, QdrantDocumentStore creates an opinionated Qdrant collection, which is meant to work well with Haystack.

The best way to use it is to create a new Document Store and then continue using it via Haystack. If you already have a Qdrant collection, you should probably need to manually migrate it.

Resources:

anakin87 commented 6 months ago

I'm closing this issue. Feel free to reopen it if something is unclear or does not work.