deepset-ai / haystack

:mag: AI orchestration framework to build customizable, production-ready LLM applications. Connect components (models, vector DBs, file converters) to pipelines or agents that can interact with your data. With advanced retrieval methods, it's best suited for building RAG, question answering, semantic search or conversational agent chatbots.
https://haystack.deepset.ai
Apache License 2.0
17.34k stars 1.89k forks source link

Differences in DocumentStores #2641

Closed bogdankostic closed 10 months ago

bogdankostic commented 2 years ago

We currently support 10 different DocumentStores in Haystack. In the following overview, I try to list the public methods that are available in only some of the DocumentStores.

get_metadata_values_by_key

This method allows the user to get the distinct values + their counts of a specific metadata field inside a DocumentStore. Currently, this is only available for DocumentStore inheriting from ElasticsearchDocumentStore. I'd guess that this is not a highly used method and, therefore, would not give this a high priority to add this to the other DocumentStores.

update_document_meta

This method allows the user to update the meta fields of a Document by providing the Document ID and a meta dictionary. It is available in almost each DocumentStore, only InMemoryDocumentStore does not support it. However, the implementation in the different DocumentStores seems to differ: While for most of the DocumentStores only the specified fields seem to be overwritten and the unspecified ones remain unchanged, for PineconeDocumentStore it seems that all of the Document's meta data will be replaced by the specified meta dictionary.

query / query_batch

This method is supported in DocumentStores inheriting from KeywordDocumentStore. Those are DocumentStores that inherently support a keyword-based search. It probably would not make sense to add those methods to the other DocumentStores, as they only support querying by embedding. Also, I think we should remove this method from WeaviateDocumentStore. There, the query param is not even supported and it rather has the functionality of get_all_documents with using filters.

query_by_embedding / update_embeddings

This method allows querying the DocumentStores using dense methods and is currently supported in all DocumentStores except for SQLDocumentStore. Maybe it might make sense to implement this functionality using something like pgvector?

Also, filtering support for FAISSDocumentStore and Milvus2DocumentStore seems to be missing here. While FAISS still does not support filtering, filtering should be possible with Milvus2.

train_index

This method is needed for some of the supported index types in FAISS. I had a quick look at Milvus, Weaviate and Pinecone documentation and it seems that with these DocumentStores, training is not required.

save / load

Saving and loading a saved a DocumentStore is currently only supported in FAISSDocumentStore. For the other DocumentStores, this method might not be needed, as they already use persistent storage, i.e. indexed Documents don't get lost. Only InMemoryDocumentStore would need such functionality, but here we (or even the user) might simply use pickle.

get_documents_by_vector_ids / update_vector_ids / reset_vector_ids

This method is used to map the vectors that are indexed in the different vector databases to the Documents that are indexed in the SQL database and is supported by document stores inheriting from SQLDocumentStore. I don't think that these methods need to be public, I would rather make them private.

get_scores / get_scores_numpy / get_scores_torch

Theses methods are used to calculate the scores for dense retrieval in InMemoryDocumentStore. IMO, these methods don't need to be public either.

tstadel commented 2 years ago

I think there are additional differences regarding labels. They are a bit hidden however. E.g. SQLDocumentStore and thus Pinecone, Milvus and FAISS too don't support metadata and filters in labels, and others (I think it's just Weaviate) don't support labels at all.

masci commented 10 months ago

Closing as outdated