Set up a vector/semantic search database as an API, then in the front end

freelawproject / courtlistener

A fully-searchable and accessible archive of court data including growing repositories of opinions, oral arguments, judges, judicial financial records, and federal filings.

https://www.courtlistener.com

Other

505 stars 138 forks source link

Set up a vector/semantic search database as an API, then in the front end #3398

Open mlissner opened 7 months ago

mlissner commented 7 months ago

People just keep asking for this, and it seems like something our customers would use if we had it.

One customer I just talked to wants to do it using Pinecone. Maybe that's an idea. Elastic also seems to make this possible and even has a product page for it: https://www.elastic.co/enterprise-search/vector-search

Maybe it's something we should do, but I do worry about how much memory it would use.

mlissner commented 7 months ago

Another option that Harvard folks are playing with: https://www.trychroma.com/

mlissner commented 7 months ago

I sent a mass email to a lot of our customers to see if people want to contribute money to building this. We'll see what kind of a response I get.

mlissner commented 7 months ago

One of our clients has done this with our data and he reports the following:

Llama Index makes things a lot easier...

Beyond just getting the data into the vector db, the following optimization activities may need to be conducted:

Building and evaluation dataset to measure retrieval performance

Tune hyperparameters like chunk size and included metadata

Evaluate more advanced chunking strategies (e.g. SentenceWindow chunking)

Fine tune an open source embedding model to squeeze out additional performance

mlissner commented 7 months ago

I talked to a couple of contacts today about this. A couple notes:

Once you chunk the data and embed it, that's about 600GB.
Quandrant has quantization out of the box that can take big floats and make them smaller so they take up less memory and save money. I don't know if we can do this with Elastic.
Llamaindex is apparently very helpful, and one guy recommended it above LangChain.
Getting the vector size correct is really hard and really important. We might need to try different things and see what performs best.
We need an evaluation dataset (with positive and negative hits) to know if our tweaks are working. This feels like a really good academic exercise, or it could also be a great one for FLP to lead and release as leaders in the field.

n-shamsi commented 7 months ago

mlissner commented 7 months ago

These analyses are great, thanks @n-shamsi. What would you suggest as the next step?

n-shamsi commented 7 months ago

These analyses are great, thanks @n-shamsi. What would you suggest as the next step?

Thank you! I made some tasks on the issues, I think they're good for follow-ups but I am open to suggestions! Here's a summary:

https://github.com/freelawproject/courtlistener/issues/3489

[ ] What's the relationship between data type and search accuracy for a given DB?
[ ] What technical implementations optimize query performance for each DB? Are they suitable for our data?
[ ] What specific ML integration is used for a given DB, and is it useful for our data?

https://github.com/freelawproject/courtlistener/issues/3490

[ ] Select a vector DB and sample dataset

I think I'll start with the one on https://github.com/freelawproject/courtlistener/issues/3490 because it will help answer the three on https://github.com/freelawproject/courtlistener/issues/3489 more effectively.

vonwooding commented 7 months ago

Re: Technical Implementations to Optimize Query Performance

Thank you, @n-shamsi, for your research and insights.

Exploring Hypothetical Document Embeddings (HyDE) might benefit the project. HyDE transforms a user's query into a hypothetical document, which is then compared for similarity with the existing document set, rather than directly comparing the query itself.

This approach could be particularly useful for Free Law Project users who may not always have the legal expertise to frame precise queries.

While the full applicability and scalability of HyDE in our context remain to be assessed, it could provide a promising direction for our query optimization efforts.

For more information:

Repo: https://github.com/texttron/hyde Paper: https://arxiv.org/abs/2212.10496

n-shamsi commented 7 months ago

Exploring Hypothetical Document Embeddings (HyDE) might benefit the project.

HyDE sounds awesome. I think we just need a sample dataset to get started. I am also interested in any suggestions from other followers on the issue, I am wondering if there's a particular solution that clients might be interested in having tested? I think Llama is on the shortlist because of that.

@mlissner is there a sample dataset we can use currently, or should @vonwooding and myself create one?

vonwooding commented 7 months ago

fwiw I might suggest SCOTUS Fourth Amendment search/seizure cases. There are probably 600-700 total opinions. Plus, it's a familiar and important issue for practitioners and the public alike.

mlissner commented 7 months ago

No, there's no evaluation data set, but this came up yesterday when I talked to somebody else. I think there's an opportunity to create a really nice evaluation data set. Seems like we should spin that off into its own issue and discuss it there? I'll invite the person I talked to yesterday to chime in?

mlissner commented 6 months ago

I'm reliably (?) informed this leaderboard shows the best models to use for vectorizing:

https://huggingface.co/spaces/mteb/leaderboard

I don't know how true that is, but it's what I hear on the street!

n-shamsi commented 6 months ago

Picking up this issue again, we could start with a sample from here: https://github.com/freelawproject/reporters-db/blob/main/reporters_db/data/laws.json

Are there any particular data features that should be included for evaluation?

mlissner commented 6 months ago

Sorry, I'm not sure what you mean, Nina. Are you suggesting we use the many laws there as the evaluation data set? Should we create a fresh issue for discussing that?