Compare vector DB options using performance and usage criteria

n-shamsi commented 10 months ago

Description:

Related to https://github.com/freelawproject/foresight/issues/8 and https://github.com/freelawproject/courtlistener/issues/3490. Compare a set of vector DB options (e.g., Pinecone, Weaviate, Qdrant, Milvus, Chroma, Llama, Langchain, pgVector, Elasticsearch etc.) along the following performance and usage criteria:

Suitability for Data Type and Size
Query Performance
Accuracy and Relevance for a Given Task
Machine Learning Integration
Resource Efficiency
Flexibility and Customization
Ease of Use and Maintenance
Security and Compliance
Community and Support
Deployment Options
Update and Indexing Capabilities
Interoperability

Initial Overview (thanks ChatGPT!):

Data Type and Size

Pinecone, Weaviate, Qdrant, Milvus, Chroma: These are generally well-suited for handling large-scale, complex datasets and are optimized for high-performance vector search.
pgVector: Being an extension of PostgreSQL, it is more versatile for general-purpose data but may not be as optimized for large-scale vector data as the others.
Llama, Langchain: These might have more specific use cases and may not be as broadly applicable to all data types.
Elasticsearch: Highly scalable for diverse data types, including text, numerical, and structured data, but vector search capabilities might be less optimized compared to specialized vector databases.

Query Performance

Pinecone, Milvus, Qdrant: Known for high query performance, especially in large-scale environments.
Weaviate, Chroma: Offer good performance with additional features like semantic search.
pgVector, Llama: Might not match the performance of more specialized vector databases in high-load scenarios.
Langchain: Its performance is more specialized for language-related tasks.
Elasticsearch: Excellent overall performance, with some limitations in vector search compared to specialized vector databases.

Accuracy and Relevance

Pinecone, Weaviate, Milvus, Qdrant, Chroma: Generally provide high accuracy in search results, with capabilities for tuning and customizing relevance algorithms.
pgVector, Llama, Langchain: Accuracy may vary based on the specific use case and implementation.
Elasticsearch: Good accuracy, with extensive features for customizing search relevance.

Machine Learning Integration

Pinecone, Weaviate, Milvus, Qdrant, Chroma: Offer strong integration with ML models and frameworks.
pgVector: As a PostgreSQL extension, integration depends on PostgreSQL's capabilities.
Llama, Langchain: Might offer specialized integration for certain types of ML models.
Elasticsearch: Supports ML integration but may not be as seamless as some vector-specific databases.

Resource Efficiency

Pinecone, Milvus, Qdrant, Chroma: Designed for efficiency in resource-intensive environments.
Weaviate: Efficient but also focuses on providing additional semantic capabilities.
pgVector, Llama, Langchain: Efficiency can vary greatly depending on the use case and setup.
Elasticsearch: Resource-efficient but might require more tuning for vector search.

Flexibility and Customization

Pinecone, Weaviate, Milvus, Qdrant, Chroma: Generally offer good flexibility and customization options.
pgVector: Flexibility is tied to PostgreSQL’s capabilities.
Llama, Langchain: May offer specialized customization for particular applications.
Elasticsearch: Highly flexible and customizable, with a wide range of plugins and configurations.

Ease of Use and Maintenance

Pinecone, Milvus, Qdrant, Chroma: Aim to balance performance with user-friendliness.
Weaviate: Known for its ease of use, particularly in semantic search contexts.
pgVector: As an extension of PostgreSQL, it inherits its ease of use and maintenance.
Llama, Langchain: Ease of use may depend on the specific application.
Elasticsearch: Widely used with extensive documentation, but can be complex to configure and maintain.

Security and Compliance

Pinecone, Weaviate, Milvus, Qdrant, Chroma: Generally provide robust security features, but specific compliance capabilities may vary.
pgVector: Inherits PostgreSQL’s security features.
Llama, Langchain: Security features may be more specific to their application domain.
Elasticsearch: Offers strong security features, with compliance depending on the specific deployment.

Community and Support

Pinecone, Weaviate, Milvus, Qdrant, Chroma: Growing communities, with varying levels of vendor and community support.
pgVector: Benefits from the large PostgreSQL community.
Llama, Langchain: Smaller, more specialized communities.
Elasticsearch: Large and active community, extensive documentation, and commercial support available.

Deployment Options

Pinecone, Weaviate, Milvus, Qdrant, Chroma: Offer flexible deployment options, including cloud and on-premise.
pgVector: Dependent on PostgreSQL's deployment options.
Llama, Langchain: May have more limited or specialized deployment options.
Elasticsearch: Highly flexible in deployment, including cloud services like Elastic Cloud.

Update and Indexing Capabilities

Pinecone, Milvus, Qdrant, Chroma: Efficient updating and indexing capabilities for dynamic datasets.
Weaviate: Good indexing capabilities with a focus on semantic understanding.
pgVector: Indexing capabilities are tied to PostgreSQL’s performance.
Llama, Langchain: Specific to their application domains.
Elasticsearch: Excellent indexing capabilities, but vector indexing might not be as efficient as specialized databases.

Interoperability

All: Varying degrees of interoperability, with most offering API access. Integration with other systems can vary based on specific use cases and existing technology stacks.

Outstanding Questions:

[ ] What's the relationship between data type and search accuracy for a given DB?
[ ] What technical implementations optimize query performance for each DB? Are they suitable for our data?
[ ] What specific ML integration is used for a given DB, and is it useful for our data?

Bonus:

Very nice summary table from this blog (writer is co-Founder of Vectorview): https://benchmark.vectorview.ai/vectordbs.html

mlissner commented 10 months ago

Looking at all this, I have a few thoughts:

I've heard a lot of people talking about Pine Cone, and one person I talked to swore by Qdrant.
We have Elastic, so I'm looking for a reason not to use it, and I'm not seeing that.

One thing I think we can do with Elastic is use our existing indexes to filter. For example, right now, we have indexes on the court field, so we can easily filter to just SCOTUS decisions. I think we can then combine that filter with a semantic search of just SCOTUS stuff, which would be really cool and a differentiator, from many in the field. I'd love if somebody could verify this.
I think we can eliminate pgvector. I'm hearing a lot about how it doesn't scale, and adding things to Postgres freaks me out generally.

So if we're looking for a forward direction from a benevolent dictator, my vote is to see if we can make Elastic work without pulling our hair out, even though it seems a bit less turnkey. It also feels worth it to choose one or two of the other new entrants (pinecone, llama, Qdrant...) and compare it as we go to see if we just plain love it more, but that means more work....

legaltextai commented 2 months ago

i tried many vector storages and believe postgres is an amazing self-hosted vector storage option, or supabase, which is an improved version of postgres and also self-hosted.

from this discussion here i took it that we don't want to introduce major changes and stick with what's been working for us. if we are happy with elastic and this is where we are going to continue to store opinion texts anyway, we can work with elastic for the vector search too. you saw the working prototype.

mlissner commented 2 months ago

Great. I agree. Thank you @n-shamsi for your analysis here. Our semantic search project is finally moving along at a quick clip. If you want to see what's going on with it, you can check out the project board here: https://github.com/orgs/freelawproject/projects/56/views/1

Thank you again!

freelawproject / courtlistener