Feature request: return search scores in `search` API

beidelson commented 4 weeks ago

With the impending deprecation of CAP's APIs, I'm looking at integrating CourtListener as a data source for Case Viewer. Case Viewer is built around an "integrated" search interface that pulls results from different sources and tries to prioritize the best results from the combined set. For that purpose, it's very helpful if the search engine provides some cardinal measure of relevance for each result, rather than just ranking each relative to other results from that source. CAP does this, for instance, with the pagerank field in their API response:

"pagerank": {
  "raw": 4.03580807328026e-08,
  "percentile": 0.010272027442526572
}

That's one of the key inputs I've been using to integrate CAP results with results from Google Scholar and elsewhere. By contrast, CL's search endpoint does rank results, but it doesn't expose whatever score is driving those rankings, so it's hard to gauge how good the result is compared to others from outside CL.

It seems like it would be easy to add a score field to the API reponse and just supply whatever you're already sorting by. (In fact, the v3 search endpoint does have a pagerank field, but as far as I can tell it's always null; the v4 search endpoint doesn't have that field at all, I think.)

Thanks for considering!

P.S. Below is an example of the Case Viewer search UI, with debug annotations on the right listing the sources for each result, to illustrate the kind of use I'm describing. You can see that Smith v. Daca Taxi, a cap result, was judged better than some of the results from Google Scholar but not others.

mlissner commented 4 weeks ago

We did pagerank many years ago, but discovered it favored older cases too much and was generally not great for relevancy. So when we moved to a more distributed architecture, it was difficult to get in place and we let it die. That's why the field is always null in v3 and gone in v4.

There are other network ranking algorithms that should be better than pagerank, but we haven't looked at them in a while. We should.

Until then, we can see if we can return the rank of each result. @albertisfu, do you know how easy that would be?

albertisfu commented 4 weeks ago

Until then, we can see if we can return the rank of each result. @albertisfu, do you know how easy that would be?

It'd be easy to return the ES score for each result in the API. We just need to consider that scores are only available when the search includes a text query. If only filters are used, the score is always 0 for all the results.

mlissner commented 4 weeks ago

Sounds great. We can document the above and then return the score accordingly.

Two other things that are in the pipeline for our relevancy engine that we might want to think about:

I've wanted to implement relevance decay for eight years: https://github.com/freelawproject/courtlistener/issues/558
We should eventually use a network ranking algorithm like pagerank.

My thought as it relates to this issue is how we want those to factor in. I think we'd want to wind up with a key sort of like CAP had that provides something like:

scores: {
  bm25: def,
  network_rank: xyz,
  decay: abc,
  composite: abc,  
}

That's very much off the top of my head, but the idea is that:

bm25 is the ranking based on Elastic's default ranking algo, BM25.
network_rank is the score generated by the network algorithm we choose. I think once we know what algo is best, we can use a better key name here. A PhD used CL data to write a paper on this, so we just need to go review it and choose the best algo.
decay could be the amount that the score is decayed.
composite is the combination of bm25, network, and decay.

One last thought as I'm riffing is that we usually call our ranking algo citegeist (like zeitgeist), because it was once based on the citation graph. Maybe we put that instead of composite to be clever?

Anyway, lots to do and think about, but I wanted to put the long term vision together so we can factor it into whatever small fix we do here.

mattdahl commented 3 days ago

Just seeing this issue!

I have a paper coming out soon where I develop a new method for ranking authoritative cases. (I call it HolmesRank -- Holistic Markov Estimation, inspired by Oliver Wendell Holmes's prediction theory of law.) I show that it outperforms existing centrality measures, which have structural properties ill-suited to case law. The model contains a decay parameter that can be learned from the data or set by the user.

Happy to contribute a PR implementing it for CL. However, I don't have a good sense of how computationally expensive it is. I only developed it for SCOTUS cases (n = 28k), but I don't know how it will scale in compute time/resources (since CL has millions of opinions now).

mlissner commented 3 days ago

This issue is about providing the scores in the API. Did you mean this one, about relevancy decay: https://github.com/freelawproject/courtlistener/issues/558

mattdahl commented 2 days ago

I understand that issue to be about using some decay factor to re-rank the results of a given ES query?

My thing is more about constructing a global ranking of cases, just based on historical citation data. So like a plug-in for the network_rank field you suggested in your previous comment in this issue.

mlissner commented 2 days ago

Got it! We need to spin this into a separate issue. This one is just about returning the value, not computing or using it. I'll get that going and copy you. Exciting.

freelawproject / courtlistener

Feature request: return search scores in `search` API #4312