RECAP extension sends empty GET requests and makes DB sad

sentry-io[bot] commented 1 year ago

Sentry has highlighted that occasionally (233 times so far), the extension sends a GET request to the API like:

https://www.courtlistener.com/api/rest/v3/recap-query/?docket_entry__docket__court=njb&pacer_doc_id__in=%2C%2C%2C%2C%2C%2C%2C%2C%2C%2C

In case that's not obvious, it's just a bunch of commas at the end: ,,,,,,,,.

Kind of a weird one! We should figure out why it's doing this, and make it stop.

Sentry Issue: COURTLISTENER-3TP

QueryCanceled: canceling statement due to statement timeout

  File "django/db/backends/utils.py", line 89, in _execute
    return self.cursor.execute(sql, params)

OperationalError: canceling statement due to statement timeout

(23 additional frame(s) were not displayed)
...
  File "django/db/backends/utils.py", line 80, in _execute_with_wrappers
    return executor(sql, params, many, context)
  File "pghistory/runtime.py", line 50, in _inject_history_context
    return execute(sql, params, many, context)
  File "django/db/backends/utils.py", line 84, in _execute
    with self.db.wrap_database_errors:
  File "django/db/utils.py", line 91, in __exit__
    raise dj_exc_value.with_traceback(traceback) from exc_value
  File "django/db/backends/utils.py", line 89, in _execute
    return self.cursor.execute(sql, params)

mlissner commented 1 year ago

RDS performance insights tells me that this is our number one query in terms of DB usage, so we should really deal with this. Here's the query:

SELECT set_config(?, ?, ?);

SELECT set_config(?, ?, ?);

SELECT COUNT(*) AS "__count"
FROM "search_recapdocument"
WHERE ("search_recapdocument"."is_available"
       AND "search_recapdocument"."pacer_doc_id" IN (?)
       AND "search_recapdocument"."docket_entry_id" IN
         (SELECT V0."id"
          FROM "search_docketentry" V0
          WHERE V0."docket_id" IN
              (SELECT U0."id"
               FROM "search_docket" U0
               WHERE U0."court_id" = ?)))

An example of how it's called is:

SELECT set_config('pghistory.context_id', '58c89a5b-91ad-4127-8612-89fd03d00a46', TRUE);

SELECT set_config('pghistory.context_metadata', '{"user": 7742, "url": "/api/rest/v3/recap-query/"}', TRUE);

SELECT COUNT(*) AS "__count"
FROM "search_recapdocument"
WHERE ("search_recapdocument"."is_available"
       AND "search_recapdocument"."pacer_doc_id" IN ('')  -- PROBLEM!!
       AND "search_recapdocument"."docket_entry_id" IN
         (SELECT V0."id"
          FROM "search_docketentry" V0
          WHERE V0."docket_id" IN
              (SELECT U0."id"
               FROM "search_docket" U0
               WHERE U0."court_id" = 'deb')))

mlissner commented 1 year ago

It keeps getting worse! This same API request is also responsible for the second worst SQL query:

SELECT set_config('pghistory.context_id', '64fa6996-dae6-4f13-9649-eb4300a36deb', TRUE);

SELECT set_config('pghistory.context_metadata', '{"user": 7742, "url": "/api/rest/v3/recap-query/"}', TRUE);

SELECT "search_recapdocument"."id",
       "search_recapdocument"."filepath_local",
       "search_recapdocument"."pacer_doc_id"
FROM "search_recapdocument"
WHERE ("search_recapdocument"."is_available"
       AND "search_recapdocument"."pacer_doc_id" IN ('')
       AND "search_recapdocument"."docket_entry_id" IN
         (SELECT V0."id"
          FROM "search_docketentry" V0
          WHERE V0."docket_id" IN
              (SELECT U0."id"
               FROM "search_docket" U0
               WHERE U0."court_id" = 'sdb')))
ORDER BY "search_recapdocument"."id" DESC
LIMIT 21

ERosendo commented 1 year ago

I reviewed almost all the events related to the issue on Sentry trying to find a pattern for this bug. Here are my findings about this bug:

The sentry issue shows the statement timeout happens when the request has a bunch of commas as a query parameter but after reviewing all the events I was able to identify four different cases that cause this timeout:
1. The list of doc_ids is really long (this issue is documented in #218)
2. The list of doc_ids is not included in the request (180602dde854429db87f38be9c79ab0f)
3. The request has a bunch of commas ( like these events: e676ff0979094f99a9f2fe3141143198, 7cf09641b50349ec979f412b6a70f25e and 154add04b499474ea1f755bc166aeb59)
4. The request has a few doc_ids and also has empty elements (commas) ( 579d74281ee449a58944ab39638a3f3f)
I found the following distribution of cases in the list of Events from Sentry:
- Case i : 127 events
- Case ii : 5 events
- Case iii : 246 events
- Case iv : 34 events
Most of the requests are coming from bankruptcy courts

I also reviewed the code of the extension and found that the current implementation creates the list of doc_ids from the anchor tags in the docket report in the findAndStorePacerDocIds method and uses the isDocumentUrl helper to check the format of the links before trying to extract the doc_id and pushing it to the list.

In cases iii and iv seems like the extension is able to identify some links for documents but fails to extract the doc_id and pushes an empty value to the list. This list with empty values turns into a bunch of commas because we join all the values using pacer_doc_ids.join(",") before sending the request to the CL API.

We can check if the extension was able to extract the doc_id before adding it to the list so We can avoid including empty values in the request but I think We also need to check why the extension is failing to extract the doc_id from the URLs (maybe the links have a different format now)

mlissner commented 1 year ago

Case i we'll deal with later.

Any ideas about case ii? It's the least common; maybe we live with it if it's hard to reproduce.

Cases iii and iv seem like the same issue: Some links aren't getting parsed properly, I guess.

I'm surprised case iv is a problem, since it has filters, but it still has the problem of "Why is this happening?"

Do you know the next step for figuring out cases iii and iv?

ERosendo commented 1 year ago

You're right about case ii. It's the least common, and It doesn't give us much information about the case that's causing the issue.

I listed cases iii and iv as two different items because case iv can help us identify dockets with this bug since the extension was able to parse some doc_ids.

I'm planning to use the CL API and the doc_ids from the events identified as case iv to create a list of cases with this issue and review their docket report trying to find the reason the extension is failing to parse the links.

mlissner commented 1 year ago

Great. Sounds perfect.

ERosendo commented 1 year ago

I searched all thedoc_ids from case IV events using the CL API and found the following cases:

ERosendo commented 1 year ago

I was finally able to reproduce one of the cases that causes the timeout exception.

At first, I thought the extension wasn't parsing properly URLs from the Docket report and that was causing the issue, but after reviewing a few of the cases listed in my previous comment I realized that wasn't correct. The extension always sends a list with the doc_ids from the docket report. Then, I started looking for other reports that contain links to documents and realized the extension is failing to parse links from the Claim Register Report ( This can explain why all the events were coming from bankruptcy courts)

I checked the format of the links on the Claims Report and found the following formats:

https://ecf.pamb.uscourts.gov/doc1/154116015440 ( We can find links similar to this one in the docket report)
https://ecf.pamb.uscourts.gov/cgi-bin/show_doc.pl?caseid=260973&claim_id=15342915&claim_num=1-1&magic_num=MAGIC (this is the link that's causing the issue)

I noticed the isDocumentUrl helper method returns true for both URLs, but The extension fails to extract the doc_id from the show_doc.pl link because it expects a /doc1 URL instead.

I found this in the Juriscraper repository, We could use the workaround to rewrite the show_doc.pl links as /doc1 URLs.

Let me know what you think.

mlissner commented 1 year ago

Nice find. That workaround is...interesting. I guess there are two things to think about:

Providing the [R] in the page, which is what this API call is trying to do.
Parsing the data on the server so we can make use of it.

On the server side, we're just not bothering. You can see that Juriscraper knows about four kinds of links, here:

https://github.com/freelawproject/juriscraper/blob/main/juriscraper/pacer/claims_register.py#L269-L287

But if you look a few lines down, you'll see that it only gets the pacer_doc_id from the doc1 URLs type.

On the client side, it's sort of not the end of the world if we don't put the [R] icons in place. I mean, it's a bummer, but, well, PACER is making life hard. We might be able to do your workaround, I don't know, but it feels pretty icky to be making a bunch (how many?) calls to that workaround URL when the page is loaded.

I think for now, because this report isn't very popular, we should just keep it simple. Can we put some code in place that blocks the lookup from happening if it lacks pacer_doc_id values, regardless of whether it comes from the docket report or claims report or somewhere else?

I'm thinking that should be pretty easy and we can create a new issue for fixing claims reports later, when somebody cares about it not working properly?

ERosendo commented 1 year ago

Sure! I will add some checks to avoid parsing the show_doc.pl links.

freelawproject / recap

RECAP extension sends empty GET requests and makes DB sad #341