Open mlissner opened 11 months ago
Another option that Harvard folks are playing with: https://www.trychroma.com/
I sent a mass email to a lot of our customers to see if people want to contribute money to building this. We'll see what kind of a response I get.
One of our clients has done this with our data and he reports the following:
Llama Index makes things a lot easier...
Beyond just getting the data into the vector db, the following optimization activities may need to be conducted:
- Building and evaluation dataset to measure retrieval performance
- Tune hyperparameters like chunk size and included metadata
- Evaluate more advanced chunking strategies (e.g. SentenceWindow chunking)
- Fine tune an open source embedding model to squeeze out additional performance
I talked to a couple of contacts today about this. A couple notes:
These analyses are great, thanks @n-shamsi. What would you suggest as the next step?
These analyses are great, thanks @n-shamsi. What would you suggest as the next step?
Thank you! I made some tasks on the issues, I think they're good for follow-ups but I am open to suggestions! Here's a summary:
https://github.com/freelawproject/courtlistener/issues/3489
https://github.com/freelawproject/courtlistener/issues/3490
I think I'll start with the one on https://github.com/freelawproject/courtlistener/issues/3490 because it will help answer the three on https://github.com/freelawproject/courtlistener/issues/3489 more effectively.
Re: Technical Implementations to Optimize Query Performance
Thank you, @n-shamsi, for your research and insights.
Exploring Hypothetical Document Embeddings (HyDE) might benefit the project. HyDE transforms a user's query into a hypothetical document, which is then compared for similarity with the existing document set, rather than directly comparing the query itself.
This approach could be particularly useful for Free Law Project users who may not always have the legal expertise to frame precise queries.
While the full applicability and scalability of HyDE in our context remain to be assessed, it could provide a promising direction for our query optimization efforts.
For more information:
Repo: https://github.com/texttron/hyde Paper: https://arxiv.org/abs/2212.10496
Exploring Hypothetical Document Embeddings (HyDE) might benefit the project.
HyDE sounds awesome. I think we just need a sample dataset to get started. I am also interested in any suggestions from other followers on the issue, I am wondering if there's a particular solution that clients might be interested in having tested? I think Llama is on the shortlist because of that.
@mlissner is there a sample dataset we can use currently, or should @vonwooding and myself create one?
fwiw I might suggest SCOTUS Fourth Amendment search/seizure cases. There are probably 600-700 total opinions. Plus, it's a familiar and important issue for practitioners and the public alike.
No, there's no evaluation data set, but this came up yesterday when I talked to somebody else. I think there's an opportunity to create a really nice evaluation data set. Seems like we should spin that off into its own issue and discuss it there? I'll invite the person I talked to yesterday to chime in?
I'm reliably (?) informed this leaderboard shows the best models to use for vectorizing:
https://huggingface.co/spaces/mteb/leaderboard
I don't know how true that is, but it's what I hear on the street!
Picking up this issue again, we could start with a sample from here: https://github.com/freelawproject/reporters-db/blob/main/reporters_db/data/laws.json
Are there any particular data features that should be included for evaluation?
Sorry, I'm not sure what you mean, Nina. Are you suggesting we use the many laws there as the evaluation data set? Should we create a fresh issue for discussing that?
People just keep asking for this, and it seems like something our customers would use if we had it.
One customer I just talked to wants to do it using Pinecone. Maybe that's an idea. Elastic also seems to make this possible and even has a product page for it: https://www.elastic.co/enterprise-search/vector-search
Maybe it's something we should do, but I do worry about how much memory it would use.