Closed sentry-io[bot] closed 1 year ago
RDS performance insights tells me that this is our number one query in terms of DB usage, so we should really deal with this. Here's the query:
SELECT set_config(?, ?, ?);
SELECT set_config(?, ?, ?);
SELECT COUNT(*) AS "__count"
FROM "search_recapdocument"
WHERE ("search_recapdocument"."is_available"
AND "search_recapdocument"."pacer_doc_id" IN (?)
AND "search_recapdocument"."docket_entry_id" IN
(SELECT V0."id"
FROM "search_docketentry" V0
WHERE V0."docket_id" IN
(SELECT U0."id"
FROM "search_docket" U0
WHERE U0."court_id" = ?)))
An example of how it's called is:
SELECT set_config('pghistory.context_id', '58c89a5b-91ad-4127-8612-89fd03d00a46', TRUE);
SELECT set_config('pghistory.context_metadata', '{"user": 7742, "url": "/api/rest/v3/recap-query/"}', TRUE);
SELECT COUNT(*) AS "__count"
FROM "search_recapdocument"
WHERE ("search_recapdocument"."is_available"
AND "search_recapdocument"."pacer_doc_id" IN ('') -- PROBLEM!!
AND "search_recapdocument"."docket_entry_id" IN
(SELECT V0."id"
FROM "search_docketentry" V0
WHERE V0."docket_id" IN
(SELECT U0."id"
FROM "search_docket" U0
WHERE U0."court_id" = 'deb')))
It keeps getting worse! This same API request is also responsible for the second worst SQL query:
SELECT set_config('pghistory.context_id', '64fa6996-dae6-4f13-9649-eb4300a36deb', TRUE);
SELECT set_config('pghistory.context_metadata', '{"user": 7742, "url": "/api/rest/v3/recap-query/"}', TRUE);
SELECT "search_recapdocument"."id",
"search_recapdocument"."filepath_local",
"search_recapdocument"."pacer_doc_id"
FROM "search_recapdocument"
WHERE ("search_recapdocument"."is_available"
AND "search_recapdocument"."pacer_doc_id" IN ('')
AND "search_recapdocument"."docket_entry_id" IN
(SELECT V0."id"
FROM "search_docketentry" V0
WHERE V0."docket_id" IN
(SELECT U0."id"
FROM "search_docket" U0
WHERE U0."court_id" = 'sdb')))
ORDER BY "search_recapdocument"."id" DESC
LIMIT 21
I reviewed almost all the events related to the issue on Sentry trying to find a pattern for this bug. Here are my findings about this bug:
The sentry issue shows the statement timeout happens when the request has a bunch of commas as a query parameter but after reviewing all the events I was able to identify four different cases that cause this timeout:
I found the following distribution of cases in the list of Events from Sentry:
Most of the requests are coming from bankruptcy courts
I also reviewed the code of the extension and found that the current implementation creates the list of doc_ids from the anchor tags in the docket report in the findAndStorePacerDocIds
method and uses the isDocumentUrl
helper to check the format of the links before trying to extract the doc_id
and pushing it to the list.
In cases iii and iv seems like the extension is able to identify some links for documents but fails to extract the doc_id and pushes an empty value to the list. This list with empty values turns into a bunch of commas because we join all the values using pacer_doc_ids.join(",")
before sending the request to the CL API.
We can check if the extension was able to extract the doc_id
before adding it to the list so We can avoid including empty values in the request but I think We also need to check why the extension is failing to extract the doc_id
from the URLs (maybe the links have a different format now)
Case i we'll deal with later.
Any ideas about case ii? It's the least common; maybe we live with it if it's hard to reproduce.
Cases iii and iv seem like the same issue: Some links aren't getting parsed properly, I guess.
I'm surprised case iv is a problem, since it has filters, but it still has the problem of "Why is this happening?"
Do you know the next step for figuring out cases iii and iv?
You're right about case ii. It's the least common, and It doesn't give us much information about the case that's causing the issue.
I listed cases iii and iv as two different items because case iv can help us identify dockets with this bug since the extension was able to parse some doc_ids.
I'm planning to use the CL API and the doc_ids
from the events identified as case iv to create a list of cases with this issue and review their docket report trying to find the reason the extension is failing to parse the links.
Great. Sounds perfect.
I searched all thedoc_ids
from case IV events using the CL API and found the following cases:
I was finally able to reproduce one of the cases that causes the timeout exception.
At first, I thought the extension wasn't parsing properly URLs from the Docket report and that was causing the issue, but after reviewing a few of the cases listed in my previous comment I realized that wasn't correct. The extension always sends a list with the doc_ids
from the docket report. Then, I started looking for other reports that contain links to documents and realized the extension is failing to parse links from the Claim Register Report ( This can explain why all the events were coming from bankruptcy courts)
I checked the format of the links on the Claims Report and found the following formats:
I noticed the isDocumentUrl
helper method returns true
for both URLs, but The extension fails to extract the doc_id
from the show_doc.pl
link because it expects a /doc1
URL instead.
I found this in the Juriscraper repository, We could use the workaround to rewrite the show_doc.pl
links as /doc1
URLs.
Let me know what you think.
Nice find. That workaround is...interesting. I guess there are two things to think about:
Providing the [R] in the page, which is what this API call is trying to do.
Parsing the data on the server so we can make use of it.
On the server side, we're just not bothering. You can see that Juriscraper knows about four kinds of links, here:
But if you look a few lines down, you'll see that it only gets the pacer_doc_id
from the doc1 URLs type.
On the client side, it's sort of not the end of the world if we don't put the [R] icons in place. I mean, it's a bummer, but, well, PACER is making life hard. We might be able to do your workaround, I don't know, but it feels pretty icky to be making a bunch (how many?) calls to that workaround URL when the page is loaded.
I think for now, because this report isn't very popular, we should just keep it simple. Can we put some code in place that blocks the lookup from happening if it lacks pacer_doc_id values, regardless of whether it comes from the docket report or claims report or somewhere else?
I'm thinking that should be pretty easy and we can create a new issue for fixing claims reports later, when somebody cares about it not working properly?
Sure! I will add some checks to avoid parsing the show_doc.pl
links.
Sentry has highlighted that occasionally (233 times so far), the extension sends a GET request to the API like:
https://www.courtlistener.com/api/rest/v3/recap-query/?docket_entry__docket__court=njb&pacer_doc_id__in=%2C%2C%2C%2C%2C%2C%2C%2C%2C%2C
In case that's not obvious, it's just a bunch of commas at the end:
,,,,,,,,
.Kind of a weird one! We should figure out why it's doing this, and make it stop.
Sentry Issue: COURTLISTENER-3TP