freelawproject / courtlistener

A fully-searchable and accessible archive of court data including growing repositories of opinions, oral arguments, judges, judicial financial records, and federal filings.
https://www.courtlistener.com
Other
528 stars 144 forks source link

Compare vector DB options using performance and usage criteria #3489

Closed n-shamsi closed 1 week ago

n-shamsi commented 8 months ago

Description:

Related to https://github.com/freelawproject/courtlistener/issues/3398 and https://github.com/freelawproject/courtlistener/issues/3490. Compare a set of vector DB options (e.g., Pinecone, Weaviate, Qdrant, Milvus, Chroma, Llama, Langchain, pgVector, Elasticsearch etc.) along the following performance and usage criteria:

  1. Suitability for Data Type and Size
  2. Query Performance
  3. Accuracy and Relevance for a Given Task
  4. Machine Learning Integration
  5. Resource Efficiency
  6. Flexibility and Customization
  7. Ease of Use and Maintenance
  8. Security and Compliance
  9. Community and Support
  10. Deployment Options
  11. Update and Indexing Capabilities
  12. Interoperability

Initial Overview (thanks ChatGPT!):

  1. Data Type and Size
  1. Query Performance
  1. Accuracy and Relevance
  1. Machine Learning Integration
  1. Resource Efficiency
  1. Flexibility and Customization
  1. Ease of Use and Maintenance
  1. Security and Compliance
  1. Community and Support
  1. Deployment Options
  1. Update and Indexing Capabilities
  1. Interoperability

Outstanding Questions:

Bonus:

Very nice summary table from this blog (writer is co-Founder of Vectorview): https://benchmark.vectorview.ai/vectordbs.html

Screenshot 2023-12-21 at 9 13 41 AM
mlissner commented 8 months ago

Looking at all this, I have a few thoughts:

  1. I've heard a lot of people talking about Pine Cone, and one person I talked to swore by Qdrant.

  2. We have Elastic, so I'm looking for a reason not to use it, and I'm not seeing that.

    One thing I think we can do with Elastic is use our existing indexes to filter. For example, right now, we have indexes on the court field, so we can easily filter to just SCOTUS decisions. I think we can then combine that filter with a semantic search of just SCOTUS stuff, which would be really cool and a differentiator, from many in the field. I'd love if somebody could verify this.

  3. I think we can eliminate pgvector. I'm hearing a lot about how it doesn't scale, and adding things to Postgres freaks me out generally.

So if we're looking for a forward direction from a benevolent dictator, my vote is to see if we can make Elastic work without pulling our hair out, even though it seems a bit less turnkey. It also feels worth it to choose one or two of the other new entrants (pinecone, llama, Qdrant...) and compare it as we go to see if we just plain love it more, but that means more work....

legaltextai commented 1 week ago

i tried many vector storages and believe postgres is an amazing self-hosted vector storage option, or supabase, which is an improved version of postgres and also self-hosted.

from this discussion here i took it that we don't want to introduce major changes and stick with what's been working for us. if we are happy with elastic and this is where we are going to continue to store opinion texts anyway, we can work with elastic for the vector search too. you saw the working prototype.

mlissner commented 1 week ago

Great. I agree. Thank you @n-shamsi for your analysis here. Our semantic search project is finally moving along at a quick clip. If you want to see what's going on with it, you can check out the project board here: https://github.com/orgs/freelawproject/projects/56/views/1

Thank you again!