Closed bogdankostic closed 10 months ago
I think there are additional differences regarding labels. They are a bit hidden however. E.g. SQLDocumentStore
and thus Pinecone
, Milvus
and FAISS
too don't support metadata and filters in labels, and others (I think it's just Weaviate) don't support labels at all.
Closing as outdated
We currently support 10 different DocumentStores in Haystack. In the following overview, I try to list the public methods that are available in only some of the DocumentStores.
get_metadata_values_by_key
ElasticsearchDocumentStore
,OpenDistroElasticsearchDocumentStore
,OpenSearchDocumentStore
InMemoryDocumentStore
,MilvusDocumentStore
,PineconeDocumentStore
,SQLDocumentStore
,WeaviateDocumentStore
This method allows the user to get the distinct values + their counts of a specific metadata field inside a DocumentStore. Currently, this is only available for DocumentStore inheriting from
ElasticsearchDocumentStore
. I'd guess that this is not a highly used method and, therefore, would not give this a high priority to add this to the other DocumentStores.update_document_meta
ElasticsearchDocumentStore
,OpenDistroElasticsearchDocumentStore
,OpenSearchDocumentStore
,MilvusDocumentStore
,PineconeDocumentStore
,SQLDocumentStore
,WeaviateDocumentStore
InMemoryDocumentStore
This method allows the user to update the meta fields of a Document by providing the Document ID and a meta dictionary. It is available in almost each DocumentStore, only
InMemoryDocumentStore
does not support it. However, the implementation in the different DocumentStores seems to differ: While for most of the DocumentStores only the specified fields seem to be overwritten and the unspecified ones remain unchanged, forPineconeDocumentStore
it seems that all of the Document's meta data will be replaced by the specified meta dictionary.query
/query_batch
ElasticsearchDocumentStore
,OpenDistroElasticsearchDocumentStore
,OpenSearchDocumentStore
,WeaviateDocumentStore
This method is supported in DocumentStores inheriting from
KeywordDocumentStore
. Those are DocumentStores that inherently support a keyword-based search. It probably would not make sense to add those methods to the other DocumentStores, as they only support querying by embedding. Also, I think we should remove this method fromWeaviateDocumentStore
. There, thequery
param is not even supported and it rather has the functionality ofget_all_documents
with using filters.query_by_embedding
/update_embeddings
SQLDocumentStore
This method allows querying the DocumentStores using dense methods and is currently supported in all DocumentStores except for
SQLDocumentStore
. Maybe it might make sense to implement this functionality using something like pgvector?Also, filtering support for
FAISSDocumentStore
andMilvus2DocumentStore
seems to be missing here. While FAISS still does not support filtering, filtering should be possible with Milvus2.train_index
FAISSDocumentStore
This method is needed for some of the supported index types in FAISS. I had a quick look at Milvus, Weaviate and Pinecone documentation and it seems that with these DocumentStores, training is not required.
save
/load
FAISSDocumentStore
InMemoryDocumentStore
Saving and loading a saved a DocumentStore is currently only supported in
FAISSDocumentStore
. For the other DocumentStores, this method might not be needed, as they already use persistent storage, i.e. indexed Documents don't get lost. OnlyInMemoryDocumentStore
would need such functionality, but here we (or even the user) might simply usepickle
.get_documents_by_vector_ids
/update_vector_ids
/reset_vector_ids
SQLDocumentStore
,FAISSDocumentStore
,PineconeDocumentStore
,MilvusDocumentStore
This method is used to map the vectors that are indexed in the different vector databases to the Documents that are indexed in the SQL database and is supported by document stores inheriting from
SQLDocumentStore
. I don't think that these methods need to be public, I would rather make them private.get_scores
/get_scores_numpy
/get_scores_torch
InMemoryDocumentStore
Theses methods are used to calculate the scores for dense retrieval in
InMemoryDocumentStore
. IMO, these methods don't need to be public either.