Empirically Determine an Appropriate Vector Size

n-shamsi commented 8 months ago

Description:

Related to https://github.com/freelawproject/courtlistener/issues/3398 and https://github.com/freelawproject/courtlistener/issues/3489. Empirically determine an appropriate vector size for a vector DB using the following strategies (this list is not exhaustive):

Benchmarking and Testing
Data Analysis and Dimensionality Reduction
Iterative Experimentation
Resource Constraints Consideration
Use Case Specificity
Model and Algorithm Dependency
Scalability and Growth Planning

Initial Overview:

Benchmarking and Testing

Performance Metrics: Evaluate key performance metrics such as query response time, throughput, and resource usage (CPU, memory) with different vector sizes.
Accuracy Metrics: Measure the accuracy of search results with different vector sizes using metrics like precision, recall, or F1 score.

Data Analysis and Dimensionality Reduction

Intrinsic Dimensionality: Analyze the intrinsic dimensionality of your data. Some datasets may naturally cluster in lower-dimensional spaces.
Dimensionality Reduction Techniques: Use techniques like Principal Component Analysis (PCA), t-Distributed Stochastic Neighbor Embedding (t-SNE), or autoencoders to experiment with reduced dimensions and observe the impact on performance and accuracy.

Iterative Experimentation

Start with Model Defaults: Begin with the default vector size provided by the model or a common size used in similar applications.
Incremental Adjustments: Gradually increase or decrease the vector size in controlled experiments, monitoring the changes in performance and accuracy.

Resource Constraints Consideration

Memory and Storage: Test how vector sizes impact memory consumption and storage requirements. Larger vectors will generally consume more resources.
Computational Complexity: Evaluate the computational cost, especially in terms of CPU or GPU usage, for different vector sizes.

Use Case Specificity

Nature of Data: The type of data (text, images, audio) can influence the optimal vector size. Some modalities might require more dimensions to capture nuances.
Search Requirements: Consider the specificity and complexity of the search queries. More complex queries might benefit from larger vectors.

Model and Algorithm Dependency

Alignment with ML Models: The vector size should align with the output of the machine learning models used for generating embeddings. Experiment with different model architectures to find the best fit.
Algorithm Compatibility: Some search algorithms may perform better with certain vector sizes or have specific requirements.

Scalability and Growth Planning

Future-Proofing: Consider how the chosen vector size will scale with the growth of the dataset and the evolving needs of the application.

Outstanding Tasks:

[x] Select a vector DB and sample dataset

mlissner commented 8 months ago

Per discussion in #3398, it seems the first thing we need to do is establish a evaluation data set. We'll have an issue for that soon, I think!

mlissner commented 2 weeks ago

@legaltextai I'm curious your thoughts here. Should we break pieces of this issue off or is this stuff mostly resolved at this point? Feels like the best thing to do is figure out how to close this issue by capturing it in other more specific ones?

legaltextai commented 1 week ago

if the ultimate goal here is to understand how much additional storage the implementation of semantic search will require, the answer will depend on: 1) if we are happy with our prototype; 2) since we know how much 50k dataset, vectorized + metadata, takes, we multiply by about 200 to get to about 10 mln cases. this is a rough estimate of course.

@n-shamsi Nina I am impressed by how thorough you are in your analysis. thank you. here is our small (prototype) based on SCOTUS cases. would appreciate your thoughts. feel free to dm me on slack.

n-shamsi commented 1 week ago

@legaltextai thanks for sharing, the prototype is really smooth! I'll run it through some basic QA analysis and get back to you

legaltextai commented 1 week ago

thank you. let me know if you'd rather access api endpoints

mlissner commented 1 week ago

I think it's about:

Performance: Storage, CPU, memory; and
Relevance: What kind of vector size do we need for good relevance

With performance taking the backseat to relevance most of the time, since you can usually fix performance.

But I'm not sure we need to keep this one open. I think iteratively building our system we'll be confronting these things whether we intend to or not. What do y'all think?

n-shamsi commented 1 week ago

But I'm not sure we need to keep this one open. I think iteratively building our system we'll be confronting these things whether we intend to or not. What do y'all think?

Agreed, it can be broken into smaller tasks

freelawproject / courtlistener