freelawproject / courtlistener

A fully-searchable and accessible archive of court data including growing repositories of opinions, oral arguments, judges, judicial financial records, and federal filings.
https://www.courtlistener.com
Other
501 stars 138 forks source link

Opinions removed from the database were not removed from search engines #3967

Open albertisfu opened 2 months ago

albertisfu commented 2 months ago

After completing #3897 and doing a check in the ES Opinions Search, I discovered something unusual: some OpinionClusters appear in the results, but clicking on the OpinionCluster trigger a 404 error. I confirmed that these have been removed from the database.

Steps to reproduce:

Here some examples:

McGuire v. Third Avenue Railroad (N.Y. App. Div. 1896)

https://www.courtlistener.com/?q=cluster_id%3A5348161&type=o&order_by=score desc&stat_Precedential=on

People v. Jordan (N.Y. App. Div. 2010)

https://www.courtlistener.com/?q=cluster_id%3A5947072&type=o&order_by=score desc&stat_Precedential=on

Goldin v. Kelly (N.Y. App. Div. 2010)

https://www.courtlistener.com/?q=cluster_id%3A5947192&type=o&order_by=score desc&stat_Precedential=on

Harrison v. Bezio (N.Y. App. Div. 2010)

https://www.courtlistener.com/?q=cluster_id%3A5948279&type=o&order_by=score desc&stat_Precedential=on

In re Clor (N.Y. App. Div. 2012)

https://www.courtlistener.com/?q=cluster_id%3A6012898&type=o&order_by=score desc&stat_Precedential=on

People v. Johns (N.Y. App. Div. 2012)

https://www.courtlistener.com/?q=cluster_id%3A6013129&type=o&order_by=score desc&stat_Precedential=on

People v. McCrae (N.Y. App. Div. 2011)

https://www.courtlistener.com/?q=cluster_id%3A5970993&type=o&order_by=score desc&stat_Precedential=on

People v. Russ (N.Y. App. Div. 2012)

https://www.courtlistener.com/?q=cluster_id%3A5990821&type=o&order_by=score desc&stat_Precedential=on

People v. Badman (N.Y. App. Div. 2012)

https://www.courtlistener.com/?q=cluster_id%3A5991871&type=o&order_by=score desc&stat_Precedential=on

In re Foley (N.Y. App. Div. 1998)

https://www.courtlistener.com/?q=cluster_id%3A6161808&type=o&order_by=score desc&stat_Precedential=on

People v. Jones (N.Y. App. Div. 1998)

https://www.courtlistener.com/?q=cluster_id%3A6163163&type=o&order_by=score desc&stat_Precedential=on

People v. Healey (N.Y. App. Div. 2000)

https://www.courtlistener.com/?q=cluster_id%3A6181359&type=o&order_by=score desc&stat_Precedential=on

In re Merante (N.Y. App. Div. 2015)

https://www.courtlistener.com/?q=cluster_id%3A6184542&type=o&order_by=score desc&stat_Precedential=on

@mlissner or @flooie Would you know what the process was for removing these clusters from the database? This way, we can identify all the IDs to remove from the Opinion Index and also consider the deletion method used so it can trigger an automatic deletion next time.

mlissner commented 2 months ago

Oof! Why do these come up first in the search results? Any idea?

I don't remember why we removed content around Jan. 15th, but maybe Bill does, or maybe we can check our Slack/Github/Email logs around then?

Is it possible that a queryset.objects.delete() wouldn't trigger signals?

albertisfu commented 2 months ago

Oof! Why do these come up first in the search results? Any idea?

Well, that query only filters by the default status Published, so the results don't have scores. I believe it could be more about the order in which they're matched in segments/shards. And it can be a weird coincidence that they're shown first in the results, or there are many deleted clusters spread randomly throughout the index.

I don't remember why we removed content around Jan. 15th, but maybe Bill does, or maybe we can check our Slack/Github/Email logs around then?

Yeah, it could have been anytime from January 15th until now. I reviewed the code to look for methods that remove clusters from the DB, but I didn't find anything. I'm wondering if that could have been done directly at the DB level?

Is it possible that a queryset.objects.delete() wouldn't trigger signals?

I just confirmed that using a queryset like: OpinionCluster.objects.filter(pk__in=[20,19]).delete()

It does trigger signals correctly.

Just like doing, that also trigger signals:

opinion = OpinionCluster.objects.get(pk=18)
opinion.delete()
mlissner commented 2 months ago

could have been done directly at the DB level?

It's...possible, but extremely unlikely. I almost never delete with SQL, because it freaks me out. Too much power and not enough language support.

It sounds like we won't know the cause. Is there a way to fix this? I guess we'll have to check all of the millions of items in the index to see if they're in the DB?

albertisfu commented 2 months ago

It sounds like we won't know the cause. Is there a way to fix this? I guess we'll have to check all of the millions of items in the index to see if they're in the DB?

Yeah, that's the way to fix it. We can do it in batches of ~1000 items or so to avoid using too many requests. Then, in Django, filter those IDs also in batches and check which were not found and remove them from the index.

mlissner commented 2 months ago

Bleh. That sounds unpleasant, but we better do it. Let's set this as down the road though, because I want to get to alerts as soon as possible and this isn't particularly harmful to users.