Open albertisfu opened 2 months ago
Oof! Why do these come up first in the search results? Any idea?
I don't remember why we removed content around Jan. 15th, but maybe Bill does, or maybe we can check our Slack/Github/Email logs around then?
Is it possible that a queryset.objects.delete()
wouldn't trigger signals?
Oof! Why do these come up first in the search results? Any idea?
Well, that query only filters by the default status Published
, so the results don't have scores. I believe it could be more about the order in which they're matched in segments/shards. And it can be a weird coincidence that they're shown first in the results, or there are many deleted clusters spread randomly throughout the index.
I don't remember why we removed content around Jan. 15th, but maybe Bill does, or maybe we can check our Slack/Github/Email logs around then?
Yeah, it could have been anytime from January 15th until now. I reviewed the code to look for methods that remove clusters from the DB, but I didn't find anything. I'm wondering if that could have been done directly at the DB level?
Is it possible that a queryset.objects.delete() wouldn't trigger signals?
I just confirmed that using a queryset like:
OpinionCluster.objects.filter(pk__in=[20,19]).delete()
It does trigger signals correctly.
Just like doing, that also trigger signals:
opinion = OpinionCluster.objects.get(pk=18)
opinion.delete()
could have been done directly at the DB level?
It's...possible, but extremely unlikely. I almost never delete with SQL, because it freaks me out. Too much power and not enough language support.
It sounds like we won't know the cause. Is there a way to fix this? I guess we'll have to check all of the millions of items in the index to see if they're in the DB?
It sounds like we won't know the cause. Is there a way to fix this? I guess we'll have to check all of the millions of items in the index to see if they're in the DB?
Yeah, that's the way to fix it. We can do it in batches of ~1000 items or so to avoid using too many requests. Then, in Django, filter those IDs also in batches and check which were not found and remove them from the index.
Bleh. That sounds unpleasant, but we better do it. Let's set this as down the road though, because I want to get to alerts as soon as possible and this isn't particularly harmful to users.
After completing #3897 and doing a check in the ES Opinions Search, I discovered something unusual: some OpinionClusters appear in the results, but clicking on the OpinionCluster trigger a 404 error. I confirmed that these have been removed from the database.
Steps to reproduce:
Here some examples:
cluster_ids
, they are indexed as well.McGuire v. Third Avenue Railroad (N.Y. App. Div. 1896)
https://www.courtlistener.com/?q=cluster_id%3A5348161&type=o&order_by=score desc&stat_Precedential=on
People v. Jordan (N.Y. App. Div. 2010)
https://www.courtlistener.com/?q=cluster_id%3A5947072&type=o&order_by=score desc&stat_Precedential=on
Goldin v. Kelly (N.Y. App. Div. 2010)
https://www.courtlistener.com/?q=cluster_id%3A5947192&type=o&order_by=score desc&stat_Precedential=on
Harrison v. Bezio (N.Y. App. Div. 2010)
https://www.courtlistener.com/?q=cluster_id%3A5948279&type=o&order_by=score desc&stat_Precedential=on
In re Clor (N.Y. App. Div. 2012)
https://www.courtlistener.com/?q=cluster_id%3A6012898&type=o&order_by=score desc&stat_Precedential=on
People v. Johns (N.Y. App. Div. 2012)
https://www.courtlistener.com/?q=cluster_id%3A6013129&type=o&order_by=score desc&stat_Precedential=on
People v. McCrae (N.Y. App. Div. 2011)
https://www.courtlistener.com/?q=cluster_id%3A5970993&type=o&order_by=score desc&stat_Precedential=on
People v. Russ (N.Y. App. Div. 2012)
https://www.courtlistener.com/?q=cluster_id%3A5990821&type=o&order_by=score desc&stat_Precedential=on
People v. Badman (N.Y. App. Div. 2012)
https://www.courtlistener.com/?q=cluster_id%3A5991871&type=o&order_by=score desc&stat_Precedential=on
In re Foley (N.Y. App. Div. 1998)
https://www.courtlistener.com/?q=cluster_id%3A6161808&type=o&order_by=score desc&stat_Precedential=on
People v. Jones (N.Y. App. Div. 1998)
https://www.courtlistener.com/?q=cluster_id%3A6163163&type=o&order_by=score desc&stat_Precedential=on
People v. Healey (N.Y. App. Div. 2000)
https://www.courtlistener.com/?q=cluster_id%3A6181359&type=o&order_by=score desc&stat_Precedential=on
In re Merante (N.Y. App. Div. 2015)
https://www.courtlistener.com/?q=cluster_id%3A6184542&type=o&order_by=score desc&stat_Precedential=on
@mlissner or @flooie Would you know what the process was for removing these clusters from the database? This way, we can identify all the IDs to remove from the Opinion Index and also consider the deletion method used so it can trigger an automatic deletion next time.