freelawproject / courtlistener

A fully-searchable and accessible archive of court data including growing repositories of opinions, oral arguments, judges, judicial financial records, and federal filings.
https://www.courtlistener.com
Other
531 stars 146 forks source link

Empirically Determine an Appropriate Vector Size #3490

Open n-shamsi opened 8 months ago

n-shamsi commented 8 months ago

Description:

Related to https://github.com/freelawproject/courtlistener/issues/3398 and https://github.com/freelawproject/courtlistener/issues/3489. Empirically determine an appropriate vector size for a vector DB using the following strategies (this list is not exhaustive):

  1. Benchmarking and Testing
  2. Data Analysis and Dimensionality Reduction
  3. Iterative Experimentation
  4. Resource Constraints Consideration
  5. Use Case Specificity
  6. Model and Algorithm Dependency
  7. Scalability and Growth Planning

Initial Overview:

  1. Benchmarking and Testing
  1. Data Analysis and Dimensionality Reduction
  1. Iterative Experimentation
  1. Resource Constraints Consideration
  1. Use Case Specificity
  1. Model and Algorithm Dependency
  1. Scalability and Growth Planning

Outstanding Tasks:

mlissner commented 8 months ago

Per discussion in #3398, it seems the first thing we need to do is establish a evaluation data set. We'll have an issue for that soon, I think!

mlissner commented 2 weeks ago

@legaltextai I'm curious your thoughts here. Should we break pieces of this issue off or is this stuff mostly resolved at this point? Feels like the best thing to do is figure out how to close this issue by capturing it in other more specific ones?

legaltextai commented 1 week ago

if the ultimate goal here is to understand how much additional storage the implementation of semantic search will require, the answer will depend on: 1) if we are happy with our prototype; 2) since we know how much 50k dataset, vectorized + metadata, takes, we multiply by about 200 to get to about 10 mln cases. this is a rough estimate of course.

@n-shamsi Nina I am impressed by how thorough you are in your analysis. thank you. here is our small (prototype) based on SCOTUS cases. would appreciate your thoughts. feel free to dm me on slack.

n-shamsi commented 1 week ago

@legaltextai thanks for sharing, the prototype is really smooth! I'll run it through some basic QA analysis and get back to you

legaltextai commented 1 week ago

thank you. let me know if you'd rather access api endpoints

mlissner commented 1 week ago

I think it's about:

With performance taking the backseat to relevance most of the time, since you can usually fix performance.

But I'm not sure we need to keep this one open. I think iteratively building our system we'll be confronting these things whether we intend to or not. What do y'all think?

n-shamsi commented 1 week ago

But I'm not sure we need to keep this one open. I think iteratively building our system we'll be confronting these things whether we intend to or not. What do y'all think?

Agreed, it can be broken into smaller tasks