Evaluation of Semantic Search Variables for Case Data Retrieval

legaltextai commented 3 months ago

Proposal: Assess critical variables affecting semantic search quality on legal corpus before implementing semantic search on CL.

Key Variables:

Embedding model
Chunking strategy
Indexing
Prompt extraction/rewriting
Reranking
Summarization

Initial Focus: Start with embedding step. Identify suitable open-source embedding model.

Project Scope: Dataset: ~50K Supreme Court opinions (>200 words each) Text chunking: Adapt to embedding model capacity (512-32K+ tokens) Models: Evaluate 6-8 open-source, 1-2 commercial embeddings Indexing: HNSW, cosine similarity (m=16, ef_construction=64); optional hybrid search Reranking: Baseline (none), optional Cohere/BGE-reranker comparison Evaluation: Subjective expert assessment, RAGAS, LlamaIndex/Langchain frameworks

Resources: Initial models are available here; expansion forthcoming.

Questions and comments are very welcome.

mlissner commented 3 months ago

Very cool. Thanks for getting this conversation started. One question I have, before we get to far, is to what extent using Elastic will limit these options. I'm just thinking that it's possible Elastic will only support some of this, so we might want to identify that before we explore things that won't work?

legaltextai commented 3 months ago

Seems like they are flexible. At least they directly mention SentenceTransformer framework which includes lots of models.

legaltextai commented 3 months ago

What makes you so loyal to Elastic? What is their advantage vs. e.g. Postgres for vector storage or keyword search?

mlissner commented 3 months ago

What I like about Elastic:

It should give us good hybrid search, allowing the front end UI we currently have to work with vector search without too much trouble.
It separates search out from the database so that they don't impact each other.
It scales horizontally
We already have it in place
It's not a third party platform that I don't trust to exist when the AI hype slows down

Things like that. Basically, it's what we have and use for search, so it seems like the right place to add more functionality?

legaltextai commented 3 months ago

I take it you host ES on your own servers and the cost is not an issue. I think it 's ok to use it for vector / hybrid search. Plus, llamaindex and langchain have libraries to integrate with ES as a vector search. https://docs.llamaindex.ai/en/stable/examples/vector_stores/Elasticsearch_demo/

mlissner commented 3 months ago

We do, yes. We have a k8s cluster for Elastic with too many servers in it. :)

Great news!

legaltextai commented 3 months ago

Update on the project:

For the subjective, ie to get your own feel which model performs better, feel free to head here to find Supreme Court cases you know very well.

For the objective assessment, using hit rate and mrr metrics, these are the numbers I got from my evaluation.

I used these llamaindex libraries. The train and eval datasets were prepared based on random 10 Supreme Court cases.

My personal preference is a small but mighty sentence-transformers/multi-qa-mpnet-base-dot-v1 model.

What I found interesting, if you fine tune a model, even on a small dataset like 10 cases, there is a noticeable improvement in performance (local:test_model_3 is my fine tuned model based on base-dot-v1).

Questions and comments are welcome.

Cheers,

PS. Hit Rate: Hit rate calculates the fraction of queries where the correct answer is found within the top-k retrieved documents. In simpler terms, it’s about how often our system gets it right within the top few guesses.

Mean Reciprocal Rank (MRR): For each query, MRR evaluates the system’s accuracy by looking at the rank of the highest-placed relevant document. Specifically, it’s the average of the reciprocals of these ranks across all the queries. So, if the first relevant document is the top result, the reciprocal rank is 1; if it’s second, the reciprocal rank is 1/2, and so on.

mlissner commented 3 months ago

I'm not sure I know SCOTUS law well enough, but I just shared this across X, BlueSky and Threads. I'll report back if we get any responses. Hopefully those will be useful, but I suspect the better way to do it is going to be objective measures like the ones you've begun using.

Thanks for this. It's really interesting.

legaltextai commented 3 months ago

Great. Thanks.

I just found out that Unstructured, in my view one of the best parsing library, uses sentence-transformers/multi-qa-mpnet-base-dot-v1 for their 'similarity' chunking. Another +1 to this mighty embedding model. As I mentioned, it can also be further improved by doing a fine-tuning prior to the embedding of the whole corpus.

I also tested voyage-legal-2, a highly specialized embedding model for legal texts. Its hit_rate and mrr are higher than any other models I tested. But it is closed, and not free. I would still go with open source.

halfprice06 commented 3 months ago

Pinging here but I also responded to the X thread on this.

I'm a practicing attorney and have a different perspective on how to evaluate the embeddings models.

I think a good way to subjectively (maybe objectively?) test these models is to use "Questions Presented" from pending (or maybe past) supreme court opinions as query input and then measure whether they find cases cited to parties and or the SCOTUS itself in its briefs.

I have been using Sonnet 3.5 as a judge to determine what are the "most important" cases cited in Petitioner and Respondent briefs and then comparing that list of cases against those retrieved by the embeddings models.

To me this is a very real world benchmark for evaluating these models.

Just a thought! Happy to contribute to this btw, rising tide lifts all boats. I'm working on a RAG app that uses courtlistner db as the starting point so happy to give back!

honeykjoule commented 3 months ago

cool project!

lawyers tend to use words very carefully, so i wonder what the benefit of semantic search has over simple keyword matching.

a while back i made embeddings of the parenthetical summary dataset and retrieved them by similarity to the user's input.

i showed my app to some lawyer colleagues and found that they ended up searching for very specific legal terms.

the top results usually included the exact words they searched for.

similarity is cool because you can search "labrador" and find case law related to "canine," but i doubt the usefulness of this for lawyers who are searching for specific terms of art.

the next step for my app was to add filters for jurisdiction and date, but i stopped working on it because i thought that semantic search wouldn't be that useful for lawyers who are looking for pinpoints in semantic space but not necessarily near neighbors.

the project has still been simmering in the back of my mind, so i am excited to see someone else working on nlp + court listener

halfprice06 commented 3 months ago

cool project!

lawyers tend to use words very carefully, so i wonder what the benefit of semantic search has over simple keyword matching.

a while back i made embeddings of the parenthetical summary dataset and retrieved them by similarity to the user's input.

i showed my app to some lawyer colleagues and found that they ended up searching for very specific legal terms.

the top results usually included the exact words they searched for.

similarity is cool because you can search "labrador" and find case law related to "canine," but i doubt the usefulness of this for lawyers who are searching for specific terms of art.

the next step for my app was to add filters for jurisdiction and date, but i stopped working on it because i thought that semantic search wouldn't be that useful for lawyers who are looking for pinpoints in semantic space but not necessarily near neighbors.

the project has still been simmering in the back of my mind, so i am excited to see someone else working on nlp + court listener

So, I can tell you that semantic similarity can be quite useful for attorneys, and it's one of the most requested features I've gotten with regard to the web app I'm building, but the attorney user has to learn how to use it for it be useful which is why I don't think semantic search can ever be the default search mode.

Casetext built this first for attorneys. They call it parallel search. Check it out:

https://parallelsearch.casetext.com/

From what I can tell, they chunked by sentences and made embeddings. The app instructs the user to search for a sentence that you don't know for sure exists in a legal opinion somewhere, but that you wish it did.

So user types in a conclusion they wish existed in a legal opinion and the app does semantic similarity to find similar sentences. This works well if there is a case that actually says that somewhere. Which is why when it works it feels like magic, and attorneys are constantly searching that high!

A couple of attorneys in my firm use this all the time because they have taken the time to understand how it works. Of course the results are quite poor if you just type in a few keywords the way attorneys normally search for caselaw.

I think the outside of LLM based agentic search where you have multiple loops back to the well to run more searches while an LLM does some kind of analysis in between, some kind of fine tuned embeddings model with a hybrid search algo which combines cosine similarity and BM25 and with a re-ranking step is going to be SOTA for single "shot" search.

BM25 is just too good in my experience.

For my RAG app we do embeddings retrieval and BM25 and then evaluate the results with separate LLM calls, then later combine all of the results into one final answer to the user. Expensive but I've found this maximally increases the footprint of the search to unearth relevant results.

halfprice06 commented 3 months ago

For what its worth, I also think casetext nailed the UI for choosing between semantic and keyword searching in their UI and CL may want to take inspiration.

I'm certainly copying parts of it.

https://casetext.com/cases

This is their default search interface that lets you select between "semantic" and keyword. Clicking inside the search box gives a drop down to let the user select and gives some instructions on how search works.

legaltextai commented 3 months ago

Thank you for comments @halfprice06 and @honeykjoule . It 's great to have a group of like minded individuals working on similar projects.

@honeykjoule I could prepare another train and eval set based on your approach. If I understand correctly, go through say random 100 SCOTUS cases, extract 'questions' (from questions presented or equivalent) and then 'cited cases' or 'truth' (the cases where the SCOTUS found answers to those questions). Correct? That's a good approach, as we have the ultimate judge (SCOTUS) for retrieval.

Semantic search alone can be useful exactly for the reason you described. "A dog bit a child" may retrieve "a cat attacked a 3-year old boy". Yes, there are cases about vicious cats :-) Depending on the facts of course, the latter could be useful to analyze the scope of dog owner's liability too.

@halfprice06 I normally chunk based on the limits of the embedding model. So, if UAE or Cohere have 512 token limit, I chunk by 300-350 word and then embed each chunk. You can see how it works in practice here (UAE/Large embedding model, hnsw index, cosine similarity, with Cohere reranking). I do agree that adding a keyword search should improve the retrieval: 1) you can implement a hybrid search and play with weights b/w semantic and keyword to see which one works best for your needs, and 2) you can give users the option to choose semantic or keyword only.

@honeykjoule Take a look at how I implemented the jurisdiction filter. Let me know if you are interested and I can share the code.

Just to remind us, this is the first step towards the best retrieval for legal texts. Finding a 'good enough' embedding model first, to use as it is or with some fine - tuning. Then we can work on other variables. See my first message above.

If there is an interest, we can discuss and agree on the best embedding model, fine tuning, and then release all embeddings as open source. As @halfprice06 said, 'rising tide lifts all boats.'

halfprice06 commented 3 months ago

@legaltextai that was me that suggested the idea of looking at SCOTUS question presented for queries, and yes that's the general idea.

Since I made the comment I have been thinking about it a little further - I think the dataset needs a good bit of "curation" to be really good, because not every case cited in any particular SCOTUS opinion is actually directly related to the question presented. SCOTUS especially will talk about all kinds of things not directly relevant to the question presented before they finally get to the meat and potatoes.

But some of the cases (a lot?) that they discuss will be directly relevant. So choosing which cases from the SCOTUS opinions to use as 'cited truth' to grade the retrieval of the embeddings models would be important. In other words, I don't think you can blindly choose which cases are cited in a SCOTUS opinion as your "correct" results, because some of the chunked references to those cited opinons will absolutely not be semantically similar. Somehow you'll need to determine if the cited case is directly relevant semantically to the question presented.

Does that make sense?

Last thing, something to think about, sometimes SCOTUS cases will for the most part reference the Circuit Courts and not previous SCOTUS opinions, for example when they are trying to resolve Circuit split. So if you don't have the circuit courts embedded yet as part of the retrieval system, it won't be possible to measure the accuracy of the model if you are using Circuit Court cases as "ground truth" cases.

If it made sense I'd be open to helping with the curation of such a dataset, as I've been planning to make something like this for a while anyway but for the Louisiana courts.

legaltextai commented 3 months ago

I think the "SCOTUS approach" might be useful to set the "golden standard" for retrieval.

For embedding model evaluation, just a step back, what llamaindex train / eval library does is it generates questions about each chunk or the whole document (court opinion in our case). It connects the question and the relevant chunk in the train and eval set. Then when you run evaluation of the embedding model it compares whether those pairs, post embedding, match the evaluation set. That's how you get those objective metrics like hit_rate and mrr.

IMHO, as far as embedding model is concerned I would choose a 'good enough' model with possible fine tuning. And then focus on the best retrieval model by playing with hybrid search, query decomposition, reranking, etc. to achieve whatever we decide the golden standard should be.

We can also embed CFR and USC and release into open source.

On a slightly different subject, if anyone is interested in building a framework for shepardizing cases, I 'd be happy to share ideas and collaborate.

halfprice06 commented 3 months ago

For embedding model evaluation, just a step back, what llamaindex train / eval library does is it generates questions about each chunk or the whole document (court opinion in our case). It connects the question and the relevant chunk in the train and eval set. Then when you run evaluation of the embedding model it compares whether those pairs, post embedding, match the evaluation set. That's how you get those objective metrics like hit_rate and mrr.

Ok that's pretty neat I was wondering where your data set for those metrics came from.

As good as some of the llm models are now, I do wonder though whether they are really capable of coming up with good questions that are similar to real user queries?

And I guess that's also a design question too, because evaluating this way is tuning the system specifically for question answering, which is not the only way to do semantic retrieval.

Questions are often semantically dissimilar from their "correct" answer and so it's interesting way to fine tune.

With casetext's parallel search, I assume they did not fine tune it for question answering, because the user is not encouraged to input a question.

My app is designed for users to ask questions and I'm constantly surprised that sometimes users are typing in short phrases or keywords instead of forming questions or even full sentences.

But all that being said I think that's a good approach to get to the "good enough" stage.

Btw I don't know what would be involved in fine tuning UAE-Large-V1 but in my non scientific testing it has seemed to do the best in my app for question answering type retrieval of the handful of open models I tried.

Regarding Voyager-legal-2, I actually helped contribute to their data set curation, I recommended a bunch of data and sites for them to scrape in emails with one of the founders. Not surprised to hear it did so well in your tests. They gave me some platform credits for my help but I never used them because I didn't want to get locked into a closed model.

Also, what about Colbert?

https://blog.vespa.ai/announcing-long-context-colbert-in-vespa/

https://huggingface.co/colbert-ir/colbertv2.0

Finally, regarding building a case citator/shepardizer, I'm down! Something I was planning to do at some point out of necessity anyway.

legaltextai commented 3 months ago

With casetext's parallel search, I assume they did not fine tune it for question answering, because the user is not encouraged to input a question.

My app is designed for users to ask questions and I'm constantly surprised that sometimes users are typing in short phrases or keywords instead of forming questions or even full sentences.

We can always add a prompt decomposition step to turn a short phrase, keywords, etc into 2-3 relevant questions -> get semantically similar excerpts for each of those -> rerank.

UAE/Large is good, and I've been using it in my app too, but through this exercise I came across sentence-transformers/multi-qa-mpnet-base-dot-v1 and like it, especially if you fine-tune it on 100-200 court cases.

I have not tried Colbert yet. Looks interesting.

I 've spent $40 on Voyage :-(

On the case citator / shepardizer, I see two options:

1) when user clicks to shepardize, we crawl through relevant links in the Courtlistener Citation Map file -> one, two more steps to extract the opinions from CL -> fast LLM reads and summarizes how the case in question was treated. It won't be fast and may be costly depending on the model, but still an option.

2) extract from CL say 010 , 020 opinions, with > 200 words (those are most likely to discuss other cases) -> get the model to go through each opinion and extract whatever information we need, incl cases cited and how treated. Get the output in a table or json. The good thing about this option is that we can use it as an opportunity to turn all those opinions into a structured format, with fields like facts, rule, reasoning, conclusion, cited statutes and cases, etc. I am sure there will be great opportunities down the road to use that data later for model training, some form of regression, and academic research. 7-8 mln cases. May take 2-3 weeks. I did it last year to extract facts only.

s-taube commented 2 months ago

@legaltextai do we need to keep this open, or as this been captured in other issues?

legaltextai commented 2 months ago

can we keep it still open under semantic search project?

mlissner commented 2 months ago

We can, yes, but what are we using this ticket for at this point that's not broken into other issues?

legaltextai commented 2 months ago

i was thinking more of a reference knowledge, a place to add more test results from embedding models

legaltextai commented 1 month ago

We did some additional evaluation of embedding models. Previous evals of models with up to 512 token context length are presented above.

Here are the findings and recommendations from this round of testing:

Questions and criteria to consider when choosing embed model:

Relatively high on MTEB leaderboard, or retrieval/law category
We are going to make embeddings publicly available. Because users will run their own queries, they will need to use the same embed model to embed queries. So, it needs to be affordable. The model will need to run under 16gb of GPU memory if it’s open source, or use widely available third party API
What’s the max context length we want to use? How does size affect performance? Is bigger better? Is longer context = bigger memory needed?

Does larger context improve retrieval results?

Some research has been done already. Evaluated OpenAI, with a range from 128 to 2048 token context. 1024 was found to be a sweet spot
What about larger chunks? Does the retrieval improve with increasing the chunk size?

The results of my tests are below. The dataset is random 30 SCOTUS cases. * Here is the description of the eval metrics.

OpenAI embed model small, 2000 token chunk size.

Ditto, 4000 token chunk size

Ditto, 8000 token chunk size (the maximum size for this model)

Is 8K token context size enough?

8000 tokens ~ 6000 words ~ 12 pages

Average number of words in decisions from harvard cold dataset is 5588. Thus, 8K tokens/6K words should cover most of the decisions. If we don't want to split the texts, and keep one embedding per opinion_id, one approach may be to embed the first 6K words of the decision which should be sufficient to capture the semantic meaning of the facts and issues of the case.

OpenAI is relatively cheap, widely available, and offers relatively long context.

How much better OpenAI’s text-embedding-large vs text-embedding-small on legal data ?

Large model is more dimensions (3072 vs 1056) , will take longer to embed , take more space , and is a little bit worse in performance

I tried two other models with a much larger context togethercomputer/m2-bert-80M-32k-retrieval and m2-bert-80M-8K-retrieval, they performed Ok on general texts (Paul Graham essays) but very poorly on SCOTUS data

How much will it cost to embed with OpenAI small embed model? 24 bln x $0.020 / 1M tokens = $480 , or half of that if we run embeddings in batches

My recommendation:

If we want to increase context length to 8k -> use OAI embed small model. Possibly do a test first , like we did for 50K scotus cases
Otherwise, use the multi-qa-mpnet-base-dot-v1 embed model that we used for the prototype (512 tokens context length). We ‘ll need to discuss how best to store the chunks in Elastic and merge them under the same opinion_id.

I am pretty sure as new, more capable, models come to the market, we may want to re-embed everything in 6-12 months. For the first version, our search should be ‘good enough’ and we will continue looking for ways to make it better.

mlissner commented 1 month ago

This all sounds good, thank you, @legaltextai! One last thing I'm trying to understand is how Elastic handles chunked contexts. In their document that you shared, it says:

Another option is to use chunking to divide long texts into smaller fragments. These smaller chunks are added to each document to provide a better representation of the complete text. You can then use a nested query to search over all the individual fragments and retrieve the documents that contain the best-scoring chunks.

Have we evaluated that as a better option that only embedding the first X tokens?

Average number of words in decisions from harvard cold dataset is 5588. Thus, 8K tokens/6K words should cover most of the decisions.

Not to be too contrarian, but I'm not sure that it's safe to go from the average to the median like this. Many documents are extremely long (hundreds of pages), and many others are very small (dozens of words), so I really don't know if the average is a good representation of what "most" docs are like. I'm not sure how this affects our strategy, but it jumped out at me.

legaltextai commented 1 month ago

I was under the impression that we decided not to go with splitting into chunks. I might have misunderstood. Are you ok to use this ES native embed task approach?

Forgot to mention, in my version of harvard cold dataset I only include those opinions that have > 200 words, so the MIN is 201, and the AVG is 5588. But I agree, there will be decisions much bigger than that.

mlissner commented 1 month ago

We did decide that, but it was before I saw the way Elastic suggests using nested queries, so now I'm wondering if we can decide on the optimal chunk size based on relevance and embed the entire documents.

legaltextai commented 1 month ago

OK, I'll build an ES index based on our existing search_opinion mapping, just for SCOTUS cases and experiment with this ES library. Openai-small with 8k context may still be our best option based on the metrics above.

mlissner commented 1 month ago

Sounds great. Alberto might have some ideas about nested queries too. I think he's played with them!

legaltextai commented 1 month ago

so, i 've played with the semantic_text

these are the resources i used https://www.elastic.co/guide/en/elasticsearch/reference/master/infer-service-openai.html https://www.elastic.co/guide/en/elasticsearch/reference/current/semantic-text.html

first, you create an inference point (in our case with openai embedding model), which creates a task that sends the text to openai embedding api and inserts the vectors back into 'embeddings' field

we then update the mapping of our existing index with a new field (e.g. "semantic_text_field"):

"semantic_text_field": {
          "type": "semantic_text",
          "inference_id": "openai-embeddings",
          "model_settings": {
            "task_type": "text_embedding",
            "dimensions": 1536,
            "similarity": "cosine",
            "element_type": "float"
          }
        },

we then copy the content of 'text' (which is court opinions) from our index into this field which triggers the embedding process

if the text is longer than 250 words, it breaks it into chunks, with embeddings created for each chunk.

the search is then run with

GET opinion_index/_search
{
  "query": {
    "semantic": {
      "field": "semantic_text_field",
      "query": "cases dealing with second amendment"
    }
  }
}

note, you don't need to embed the query, it's done automatically by sending it to 'semantic_text_field'

There are these limitations on the use of semantic_text

semantic_text field types have the following limitations: semantic_text fields are not currently supported as elements of nested fields. semantic_text fields can’t be defined as multi-fields of another field, nor can they contain other fields as multi-fields.

this semantic_text protocol takes a bit of time to fine tune to work with our existing index. it also means we insert he embeddings into elastic first, and then copy into s3, not the other way around.

another option is just to embed everything outside elastic, put into s3 and then copy into our existing index.

@albertisfu what's your view , given the above limitations, if we can add this new field and run the embedding task in opinion_index? or do you think we should create a separate index for semantic search only?

albertisfu commented 1 month ago

@albertisfu what's your view , given the above limitations, if we can add this new field and run the embedding task in opinion_index? or do you think we should create a separate index for semantic search only?

Regarding the limitations of semantic_text with nested_fields or multi-fields, I don't see an issue since I understand we're only embedding the text field in the opinion_index, correct? This is just a simple TextField.

This brings up another question: Should we include other fields in the infer_field embedding, like case_name, or is only the opinion text relevant for this semantic search use case?

As for whether we should use the same opinion_index or a different index, my only concern with using opinion_index is how embedding generation during the indexing process might affect the overall indexing time. Do you know if the embedding process takes place within the same ingestion pipeline, or can it be done in a separate step? I'm worried the embedding process could slow down indexing, especially when ingesting large batches of documents, such as during scrapes. This issue would become more significant when we start running Opinion Search Alerts using the percolator, where fast indexing will be important for sending real-time alerts.

If the impact on indexing performance isn't too severe, I believe having just one index would be ideal. However, if generating embeddings at indexing time is costly, it might be better to maintain a separate index.

In this tutorial, I saw that they use the re_index API to generate embeddings from the source index into the target index, which also triggers the embedding process. If generating embeddings during indexing is too resource-intensive, this could be an option, allowing us to generate embeddings once a day based on documents that were indexed or updated during the day. However this will require a especial index only for semantic search.

Regarding Mike's question:

Sounds great. Alberto might have some ideas about nested queries too. I think he's played with them!

My understanding of semantic_text usage is that Elasticsearch handles the generation of chunks for both indexing and searching under the hood. There's no need to use custom nested queries to search within the semantic field, which may have been a solution in the past but is now addressed by semantic_text. Right @legaltextai?

legaltextai commented 1 month ago

@albertisfu the best would be to give you an access to my replica of opinion_index with embeddings and let you play with it to make sure the presence of embeddings in the same index won't affect the main search. i will follow up via slack

mlissner commented 1 month ago

The more I think about it, the more it seems like doing this in a separate process is best:

We use several fields to hold the text of the doc. Elastic won't be smart enough to choose the correct one.
Automatically generating embeddings is something that costs money. Seems like the kind of thing that could go badly if, for example, we're updating a record in the DB.
Using Elastic for this means we can't deal with performance issues as cleanly since Elastic is doing things aside from just responding to queries.
This means the embeddings aren't in S3 or even our DB, but are in Elastic, which isn't the design we want.

albertisfu commented 1 month ago

@legaltextai I did some testing on your replica.

I checked the following indices:

opinion_index
opinion_index_embed
opinion_index_semantic

It seems that in opinion_index, you don't have embeddings generated, correct? In opinion_index_embed, it looks like you only have embeddings, with no other fields indexed. In opinion_index_semantic, you have both the regular fields and the generated embeddings. I tested queries in this index, but they're not working properly due to the following issue:

It appears that the mapping for this index was autogenerated by elasticsearch when the documents were copied. As a result, the mapping didn't retain the necessary settings required for regular search, as in the production environment.

The cluster_child field was not properly defined as a join field.
The raw field versions were not generated.
Text fields that support highlighting didn't get the correct setting for vector highlighting: "term_vector": "with_positions_offsets".

To fix this, the index mapping should be explicitly defined at the time of index creation, before copying any documents.

opinion_index_mapping.txt

Once this issue is resolved, the semantic_text field can be added.

Aside from this issue with the regular search, I agree with Mike that it might be better to use a different approach than semantic_text, especially since embeddings will be generated automatically, as mentioned in docs:

This means there's no need to create an inference pipeline to generate the embeddings. Using bulk, index, or update APIs will do that for you automatically.

We perform a lot of updates on documents, and if embedding re-computation is triggered on every update for both parent and child documents, it could be quite costly. It would be better to control embedding generation outside of Elasticsearch, on a more reasonable schedule that balances data up-to-date and cost.

If we don't use semantic_text, could we still use sparse_vector or dense_vector fields to generate vectors outside of elasticsearch, and then update the documents with their embeddings using one of those fields?

legaltextai commented 1 month ago

Thanks Alberto. What do you get when you run:

GET opinion_index/_search { "query": { "semantic": { "field": "semantic_text_field", "query": "cases dealing with second amendment" } } }

I have embeddings there and I can search semantically

  "took": 333,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 15,
      "relation": "eq"
    },
    "max_score": 0.7372532,
    "hits": [
      {
        "_index": "opinion_index",
        "_id": "o_8986128",
        "_score": 0.7372532,
        "_routing": "8993865",
        "_source": {
          "court_citation_string": "SCOTUS",
          "syllabus": "",
          "lexisCite": "",
          "court_id": "scotus",
          "local_path": null,
          "absolute_url": "/opinion/8993865/miller-v-united-states/",
          "type": "lead-opinion",
          "suitNature": "",
          "dateReargumentDenied": null,
          "cluster_id": 8993865,
          "dateArgued": null,
          "panel_names": [],
          "panel_ids": [],
          "neutralCite": "",
          "cluster_child": {
            "parent": 8993865,
            "name": "opinion"
          },
          "download_url": null,
          "cites": [],
          "procedural_history": "",
          "caseName": "Miller v. United States",
          "semantic_text_field": {
            "text": "\nC. A. 9th Cir. Reported below: 431 F. 2d 655;\nCt. App. D. C. Reported below: 277 A. 2d 477;\nC. A. 10th Cir. Reported below: 445 F. 2d 945;\nC. A. 9th Cir. Reported below: 455 F. 2d 899; and\nC. A. 2d Cir. Reported below: 475 F. 2d 1393. Certiorari granted, judgments vacated, and cases remanded to the respective United States Courts of Appeals for further consideration in light of Miller v. California, ante, p. 15; Paris Adult Theatre I v. Slaton, ante, p. 49; *914Kaplan v. California, ante, p. 115; United States v. 12 200-ft. Reels Film, ante, p. 123; United States v. Orito, ante, p. 139; Heller v. New York, ante, p. 483; Roaden v. Kentucky, ante, p. 496; and Alexander v. Virginia, ante, p. 836.\nMe. Justice Douglas would grant cer-tiorari and reverse the judgments. See Miller v. California, ante, p. 37.\n",
            "inference": {
              "inference_id": "openai-embeddings",
              "model_settings": {
                "task_type": "text_embedding",
                "dimensions": 1536,
                "similarity": "cosine",
                "element_type": "float"
              },
              "chunks": [
                {
                  "text": "\nC. A. 9th Cir. Reported below: 431 F. 2d 655;\nCt. App. D. C. Reported below: 277 A. 2d 477;\nC. A. 10th Cir. Reported below: 445 F. 2d 945;\nC. A. 9th Cir. Reported below: 455 F. 2d 899; and\nC. A. 2d Cir. Reported below: 475 F. 2d 1393. Certiorari granted, judgments vacated, and cases remanded to the respective United States Courts of Appeals for further consideration in light of Miller v. California, ante, p. 15; Paris Adult Theatre I v. Slaton, ante, p. 49; *914Kaplan v. California, ante, p. 115; United States v. 12 200-ft. Reels Film, ante, p. 123; United States v. Orito, ante, p. 139; Heller v. New York, ante, p. 483; Roaden v. Kentucky, ante, p. 496; and Alexander v. Virginia, ante, p. 836.\nMe. Justice Douglas would grant cer-tiorari and reverse the judgments. See Miller v. California, ante, p. 37.\n",
                  "embeddings": [
                    0.019536218,
                    0.021453455,
                    0.0067173243,
...

albertisfu commented 1 month ago

Yes, I got results from that semantic query. So only issue now is related to the mapping required to support the regular queries we have in production.

legaltextai commented 1 month ago

so, just make sure that we update the mapping correctly if we were to add semantic_text to our existing opinion_index? which sounds like we are not going to do anyway

freelawproject / courtlistener

Evaluation of Semantic Search Variables for Case Data Retrieval #4277