freelawproject / courtlistener

A fully-searchable and accessible archive of court data including growing repositories of opinions, oral arguments, judges, judicial financial records, and federal filings.
https://www.courtlistener.com
Other
551 stars 151 forks source link

Add network-based ranking to search results for enhanced relevancy #4436

Open mlissner opened 2 months ago

mlissner commented 2 months ago

Over in https://github.com/freelawproject/courtlistener/issues/4312#issuecomment-2339308370, we have a little discussion about a new network ranking algorithm that @mattdahl is working on.

He says:

I have a paper coming out soon where I develop a new method for ranking authoritative cases. (I call it HolmesRank -- Holistic Markov Estimation, inspired by Oliver Wendell Holmes's prediction theory of law.) I show that it outperforms existing centrality measures, which have structural properties ill-suited to case law. The model contains a decay parameter that can be learned from the data or set by the user.

Happy to contribute a PR implementing it for CL. However, I don't have a good sense of how computationally expensive it is. I only developed it for SCOTUS cases (n = 28k), but I don't know how it will scale in compute time/resources (since CL has millions of opinions now).

I think this would be great. In the past, with Solr, we computed pagerank once/month, and it cranked out a CSV file that we used to boost results. It worked fine until we moved Solr to its own server and it became too difficult to move the CSV over to that server from the one that computed it.

I suspect Elastic will have a better solution. @albertisfu, do you know if Elastic has a way of pulling boosting scores from a file? Any ideas if it can be done from S3 or something like that?

Matt, if this is doable technically, I'd love to consider doing it. Can you share your paper? I've been wanting a better network rank for a while.

I think if we put this in place, we'd wind up with:

Any others I'm forgetting?

That's feeling like a lot, but maybe it's fine. Maybe we wouldn't need the time decay with this algo?

albertisfu commented 2 months ago

I suspect Elastic will have a better solution. @albertisfu, do you know if Elastic has a way of pulling boosting scores from a file? Any ideas if it can be done from S3 or something like that?

I did some investigation on this:

The ability to store a script in a file to tweak relevance was removed since Elastic 6.0

This feature was replaced by stored scripts, as explained in the documentation: https://www.elastic.co/guide/en/elasticsearch/reference/current/modules-scripting-using.html#script-stored-scripts

Documentation says:

You can store and retrieve scripts from the cluster state using the stored script APIs. Stored scripts reduce compilation time and make searches faster.

Using this feature, we would need to create a stored script via the API using a request like:

PUT _scripts/my-stored-script
{
   "script":{
      "lang":"painless",
      "source":"return params.scores.containsKey(doc['docket_id'].value)",
      "params":{
         "scores":{
            "cluster_id_1":1.5,
            "cluster_id_2":0.8,
            "cluster_id_3":2.0
         }
      }
   }
}

Then, in the query, we could use a script_score along with this stored script to enhance relevance:

{
  "query": {
    "script_score": {
      "query": {
        "match_all": {}
      },
      "script": {
        "id": "my-stored-script", 
      }
    }
  }
}

However, we should assess the performance of this approach, as I assume the stored script may need to handle millions of parameters (one score per document)? Another downside is that new documents won’t be included in the stored script’s parameters, so they won’t be boosted by the relevance script until the parameters are updated.

An alternative approach for implementing this feature in ES could be to use a rank_feature field and query.

This is the method ES recommends for applying static signals (similar to PageRank) to enhance relevance: https://www.elastic.co/guide/en/elasticsearch/reference/current/static-scoring-signals.html https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-rank-feature-query.html

This approach seems to be the best in terms of performance. According to the documentation:

Unlike the function_score query or other ways to change relevance scores, the rank_feature query efficiently skips non-competitive hits when the track_total_hits parameter is not true. This can dramatically improve query speed.

Since we currently have track_total_hits disabled in our queries, we would benefit from this performance improvement.

However, this approach would require adding a new rank_feature field to the documents. This field has the following constraints:

The rank_feature field can then be used in queries like this:

{
  "query": {
    "bool": {
      "must": {
        "match_all": {}
      },
      "should": {
        "rank_feature": {
          "field": "pagerank", 
          "saturation": {
            "pivot": 10
          }
        }
      }
    }
  }
}

The rank_feature score can only be used with the following mathematical functions:

One downside to this approach is that it requires adding a new field to the documents and populating it with values. If we only add this field to the parent documents, we could potentially update the new values relatively quickly using an asynchronous update_by_query request.

However, I’m unsure whether the ranking algorithm we'll use can compute scores for new documents at indexing time, or if it would require re-evaluating the entire corpus to compute fresh scores for all documents when new ones are added or removed. If the latter is the case, we’d need to perform a whole reindex for this field each time the scores are recomputed.

Let me know if you need more details about any of these approaches.

mlissner commented 2 months ago

Interesting. Sounds like the second approach is probably the way to go.

@mattdahl can correct me, but I think we'd want to update this value once a month or so, which would be a bummer if it meant updating every value in all 10 million decisions, particularly since the vast majority of the cases don't get cited at all, let alone in a given month.

I haven't seen how his algorithm works, but it'd be really neat if we had a way of re-running it every month and then only updating some of the cases, like those that got new inbound citations, say, or that changed by more than a certain threshold. If it's like Pagerank though, I worry that one new citation will affect the entire network of ranks.

Do you have any thoughts, Matt?

mlissner commented 2 months ago

Looks like we've been talking about network ranking algos for a while and we've got some other research on such things over here: https://github.com/freelawproject/courtlistener/issues/643.

Before we solve this one, we should be sure to check out that work and make sure we're making the right move. Perhaps @mattdahl or @idc9 will know off the top of their head though!

mattdahl commented 1 month ago

Sorry for my slow response! Here is my paper: Dahl_ChainNovel_APSA.pdf

Can you clarify what you think the bottleneck will be? Is it

  1. The calculation of the scores, or
  2. The propagation of the scores to the database (which also triggers reindexing?)?

If the former, then Mike is correct that under my algorithm (which is based on PageRank), any changes to the network will require re-computation of all the scores. So there is no way to only compute scores for the new cases that have just been added in e.g. the last month. I also don't have a good sense of how long the computation would take on the CL network, which is huge. (My paper only looks at the SCOTUS network; cases n = ~27k, citations n = ~200k.) However, my code uses graph-tool, which is written in C and is faster than the other main network packages.

If monthly computation is too expensive, it might be possible to simply initialize the scores for new cases (as they get added) at some fixed score, and leave the existing cases' scores intact for longer periods of time. In the paper there is some analysis of what happens to the scores as the network is permuted, which suggests that the scores are actually pretty stable as new citations are added (Table 2). So we may not lose much by not re-calculating everything very frequently. (Though, again, my paper only looks at the SCOTUS network and the full CL network might exhibit other properties. The Table 2 analysis also doesn't capture the scenario of new cases being added, only new citations.)

If the bottleneck is number 2, I have no idea how ES works and can't really speak to that problem. If it's cheaper to only update in the database cases whose scores have changed by more than some delta, I see no reason not to do that.


On the topic of choosing a network algorithm in the first place, the primary target of my paper is the Kleinberg HITS (or hubs and authorities) algorithm (which was popularized in the legal literature by Fowler et al. and has been widely used). I show that HolmesRank outperforms that, as well as simpler metrics like out-degree and in-degree (Table 1). It's not in the paper, but HolmesRank is also better than PageRank (I think everyone agrees that PageRank is ill-suited to case law because of its temporal bias).

I also just re-read Iain's paper. He looks at some additional algorithms that aren't in my paper, the best of which is CiteRank, which is a time-aware version of PageRank. He also notes the surprising performance of out-degree. HolmesRank builds in both of these things (though it justifies them on different theoretical grounds: for me, the idea is to capture the notion of Shepardizing in the legal research process). But we don't have direct comparisons here. And if all of this is too expensive to compute anyway, it might make sense to just do something simple (like in-degree or out-degree).

mlissner commented 1 month ago

Thanks for the notes, Matt. This is such an interesting problem space.

I haven't read your paper, but if it's anything like PageRank, I think both computing the scores and propagating them into Elastic will be expensive. Doing that once is fine; doing it over and over would make us sad. The ideal solution is one that we could do in a realtime basis, but that's probably impossible.

For calculating the scores, if I recall correctly from when we last did PageRank scores, there are two CPU- and memory-expensive steps:

We started out using networkx for this, but it couldn't perform well enough and we switched to a C-based library (though I forget its name atm). Even with the C-based library though, it took a lot of memory and time to do the calculation. That was ages ago and we have a lot more data now.

For propagating into the search engine, with Solr, the output was a CSV file, so that part was easy. With Elastic, it seems like it'll be a lot harder since we'll need to make 10M update requests to the index each time. Ouch.


So I guess I'm left trying to think of hacks:

  1. Break the network into smaller pieces. Maybe we derive scores for everything from 1600-2000 in one go, and then we have a second network for the 21st century. That's pretty messy, but it's a thought. Or maybe we give everything a score of 0.5 if it's older than 50 years and we only give calculated scores to modern cases.

  2. Is there a way to update a network object instead of loading it from scratch each time? E.g., maybe we pickle the graph object to disk in between runs?

  3. Is there a way to update existing scores instead of calculating them from scratch each time?

  4. If we keep scores from last time, can we only update the ones that have changed significantly from what's currently in the ElasticSearch index? This seems doable just be looking at the old scores and comparing, but we'd need to know what threshold is enough to do the updates.


Or maybe we just use out-degree? This is just the number of items a case cites and that's it, right? If we use this, then I suppose we can easily do it in realtime, which is pretty great.

mattdahl commented 3 days ago

For calculating the scores, if I recall correctly from when we last did PageRank scores, there are two CPU- and memory-expensive steps:

  • Pulling all the citations from the DB and building them into a network.
  • Calculating PageRank

This is a helpful way to think about it! I just ran a quick test using graph-tool on the SCOTUS network I have (cases n = ~27k, citations n = ~200k). Creating the network from the list of citations takes 0.9 seconds on my machine; calculating my HolmesRank scores takes 0.03 seconds; and calculating in-degree and out-degree both take less than 0.00 seconds. These are just some hand-wavy benchmarks, but it suggests that step 1 (network creation) is more expensive than step 2 (the actual score calculation).

This is good for us, because this "hack" you suggested is available to us:

  1. Is there a way to update a network object instead of loading it from scratch each time? E.g., maybe we pickle the graph object to disk in between runs?

This is possible in graph-tool. Once a graph is created, we can save it to disk using the special .gt format: https://graph-tool.skewed.de/static/doc/gt_format.html. Then, when we need to update the graph, we can simply load it back into memory and call add_edge_list() on it with a list of the new citations.

However, this still leaves the problem of step 2, calculating the scores. I do not know of a way to easily update pre-existing scores, so I think we have three options:

  1. Use out-degree. This solves the update problem because the scores never change!
  2. Use in-degree. These scores do change, but this is probably the cheapest centrality metric to calculate.
  3. Use something complex like HolmesRank/PageRank. These scores do change, and these scores will be more expensive to calculate. (How much more expensive? I'm not sure: I'm not sure how the cost will scale in huge networks.)

At least as a first pass (and possibly as a permanent one), I lean toward option 1. The main objection would be that it is atheoretical, because there is little reason to believe that the number of precedents cited would be a good measure of importance. However, both Iain's work and my own shows that out-degree is actually surprisingly okay. In terms of accuracy, my paper shows: HolmesRank > In-Degree > HITS > Out-Degree > Binary Expert Judgments. (My paper is also published now for anyone interested: https://doi.org/10.1111/jels.12401.)

Finally, there is step 3: propagating the scores to ES. I don't have any insight about how to do this efficiently.