freelawproject / courtlistener

A fully-searchable and accessible archive of court data including growing repositories of opinions, oral arguments, judges, judicial financial records, and federal filings.
https://www.courtlistener.com
Other
544 stars 150 forks source link

Implement Relevance Decay based on filing date #558

Open mlissner opened 8 years ago

mlissner commented 8 years ago

Basic idea: Old cases are less important than new cases. This is a common situation, and we should implement something to demote old cases slightly.

There's a good talk about this here: http://www.slideshare.net/lucenerevolution/potter-timothy-boosting-documents-in-solr

mlissner commented 5 years ago

Our resident expert suggestions the following in our Slack room:

[a simple approach is] weighing edges by e^(-t / H) where t = age of the edge and H is some half-life parameter for example, weighing edges by e^(-t / H) where t = age of the edge and H is some half-life parameter so do a weighted sum of edges as opposed to counting edges with the same weight

Seems easier than incorporating it into the Solr query because this just incorporates it into our ranking score for each case, which we're already calculating.

mlissner commented 2 months ago

@albertisfu, I'd be curious to get your estimate here of how hard you think this would be, and what your approach would be.

For case law, I think we'd want about a slower decay. Things more than about 50 years old start to lose relevancy. For RECAP, I think we'd want more of a five year decay?

albertisfu commented 2 months ago

I don't think implementing this relevancy function won't be difficult. My approach would be similar to what we did when implementing a custom function score for sorting child documents in RECAP, as seen in build_custom_function_score_for_date.

In this case, we would just need to implement the formula proposed above.

There are actually some built-in decay functions for scoring in Elasticsearch, as shown here: https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-function-score-query.html

So, I don't think implementing our custom decay formula will be an issue.

I imagine this would be a new "Sorting" option in the frontend and API that can be selected as an alternative to other sorting methods, correct?

Should the decay method take into account the docket's date_filed, or will it be based on the RECAPDocument entry_date_filed?

mlissner commented 2 months ago

Well, having it built in sure makes it easier! I wouldn't want this to be its own sorting field. I'd want it to be part of the overall score. Is that doable?

Should the decay method take into account the docket's date_filed, or will it be based on the RECAPDocument entry_date_filed?

In general, I think this won't matter too much, because they're usually within a few years of each other, so if this affects performance, either is fine. But I think in general, entry_date_file is going to be a more accurate value for relevancy, because it's something that happened today while the date_filed field could be years ago in a case that lasts a long time.

albertisfu commented 2 months ago

I wouldn't want this to be its own sorting field. I'd want it to be part of the overall score. Is that doable?

Yes, in this case, I think we always want to combine the original ES score with the decay score, correct? If that's the case, it's doable. We just need to experiment with the boost_mode parameter to determine which way of boosting the original score with the decay score works best.

In general, I think this won't matter too much, because they're usually within a few years of each other, so if this affects performance, either is fine. But I think in general, entry_date_file is going to be a more accurate value for relevancy, because it's something that happened today while the date_filed field could be years ago in a case that lasts a long time.

Got it. Yes, in this case, I’d recommend using the docket date_filed because its performance would be much better than using entry_date_filed, which requires the score function to iterate through all the matched child documents to compute the score, rather than just checking the parent date_filed.

mlissner commented 2 months ago

Lovely. Let's get this on your backlog, since it should really help with relevance. Maybe we do this and then do https://github.com/freelawproject/courtlistener/issues/4312 as a second step.