freelawproject / courtlistener

A fully-searchable and accessible archive of court data including growing repositories of opinions, oral arguments, judges, judicial financial records, and federal filings.
https://www.courtlistener.com
Other
538 stars 148 forks source link

Missing pacer free documents #4143

Closed quevon24 closed 2 months ago

quevon24 commented 3 months ago

Bill recently realized that some free pacer documents are missing (opinions) even though the scraper continues to run every day.

I've been digging to see why this is happening.

Manually inspecting in "mad" i found these missing documents:

5/18/2024 – 6/17/2024: 7 documents missing 3/18/2024 – 4/17/2024: 5 documents missing 2/18/2024 – 3/17/2024: ~11 documents missing 2/18/2024 – 3/17/2024: ~22 documents missing

I have been testing the courtlistener command and directly in juriscraper and it works perfectly, it does not lose documents.

I already checked a database dump and i couldn't find any failed rows related to the results above and the only way a PACERFreeDocumentRow row can be deleted from is if the entry has been created in courtlistener so I thought it might be on courtlistener (although it would be strange if recap didn't mark it) but I couldn't find them in the system.

Also verified that the command also merges correctly the docket entry if the docket already exist.

In my local environment I was seeing this exception MaxRetriesExceededError which is related to the celery task, i had to rerun the command to get all the documents into my local instance. I've already been investigating sentry but I couldn't find any logs related to this, so maybe the problem is not there.

The only thing I can think of is that those documents for some reason were added after the scraper was run but I didn't see any timestamp or similar to be able to verify this.

I'm still thinking about why this happened, any suggestions are welcome

mlissner commented 3 months ago

What would happen if the court blocked us for a day or for part of a night?

quevon24 commented 3 months ago

What would happen if the court blocked us for a day or for part of a night?

It is a good suggestion, I already checked what you mentioned, assuming that we are blocked I would expect to receive a 403 code or some other error code, we use the raise_for_status() function which would throw us an exception when trying to generate the report, I checked in sentry but I did not see any error related to this, then it is unlikely to happen

mlissner commented 3 months ago

Have you seen the messages in Sentry about "Unable to get free documents between X and Y" or similar? It's an error we get pretty regularly?

quevon24 commented 3 months ago

Have you seen the messages in Sentry about "Unable to get free documents between X and Y" or similar? It's an error we get pretty regularly?

Yes, I had already seen the error, Ramiro already shared the logs with me to see if the problem is there

quevon24 commented 3 months ago

I add this as additional information, it seems that it is possible for them to add more opinions on the same day, yesterday (2024-06-28) morning I could see 4 opinions for "wawd", today (2024-06- 29) I checked for the same date (2024-06-28) and there are 49 opinions instead of 6. In addition, it seems this for that specific court the scraper is failing a lot to obtain the free documents, I am still working on finding the issue because it seems that it is not happening with all the courts

mlissner commented 3 months ago

Sounds like you may have found the heart of it! Seems like courts must have some strange process for marking things as free opinions.

quevon24 commented 3 months ago

Today (2024-07-01) they marked a new document as a free opinion, 50 instead of 49.

quevon24 commented 2 months ago

I think I finally found the error, the problem seems to be related to a header being added due to ProxyPacerSession, when you check the response headers running the command in CL it also contains this: 'Transfer-Encoding': 'chunked' which implies that the server will send the response in pieces, the problem is that even if you use iter_content() not all the pieces are received, perhaps the server does not know how to handle that and that is why the response is cut off?

In juriscraper we do not have that class, if I run a script directly there with the PacerSession class the error does not occur (it only happens by slowing the connection but at absurd speeds like dialup) and the response returns complete without problems and does not contain the Transfer-Encoding header.

I already tried changing ProxyPacerSession for PacerSession in courtlistener and it seems to return the complete response without problems(no more transfer-encoding header), here the detail is that we cannot do this (Alberto already told me about the problem that the IP changes for each pod and we required the same ip for all the requests). So now I have to find out why that header is being added to the response (something that is sent is making it respond like this).

I will update this as soon as I find the answer or something of interest.

quevon24 commented 2 months ago

Reviewing the logs of the webhook-sentry container I was able to observe that in the courts that fail it always shows this message:

WARNING Socket idle read time out reached:

Eduardo suggested me to increase the read time out using an env variable. It took me a little while to figure out what the correct format was, but it seems like this finally solved the problem.

I suppose this happens because some of the court servers are slower than others to return the chunks, when the timeout is reached the content is cut to the last piece obtained even though it returns a 200 code.

For one of the courts, filtering using only one day it takes almost 27 seconds to return a page with only 6 opinions. The proxy default read timeout value is 10 seconds.

For the moment i set that value to 30 seconds in cl-webhook-sentry service in docker compose using this format:

READTIMEOUT=30s

Here I saw the correct format for the environment variable: https://github.com/juggernaut/webhook-sentry/blob/master/proxy/config.go#L40

I'll try to find the best value that fits our requirements or we could simply try to set it with a higher value like 60s and monitor logs to tweak the value.

mlissner commented 2 months ago

Woah, nice one. Yeah, let's set this to a higher value, maybe 60s, and see if that fixes it. Do we have a way to view, say, a week's worth of logs to see in a week if 60s is enough?

quevon24 commented 2 months ago

Woah, nice one. Yeah, let's set this to a higher value, maybe 60s, and see if that fixes it. Do we have a way to view, say, a week's worth of logs to see in a week if 60s is enough?

I think we only need to see the logs of the cl-webhook-sentry service, we will see something like:

2024-07-06 13:54:36 [2024-07-06T19:54:36Z] 00000000-0000-0000-0000-000000000001 172.18.0.11:46200 GET https://xyz 200 906ms
2024-07-06 13:54:46 [2024-07-06T19:54:46Z] 00000000-0000-0000-0000-000000000002 WARNING Socket idle read time out reached: 
2024-07-06 13:54:46 [2024-07-06T19:54:46Z] 00000000-0000-0000-0000-000000000002 172.18.0.11:46210 POST https://xyz/?nonce_val 200 60000ms

it will fail but we will know that 60s were not enough, then we could increase it little by little.

Also with the last changes I merged, in sentry we will get an "IndexError: list index out of range"

Now I am going to try to detect if it is possible to find the gap in each of the courts that failed.

quevon24 commented 2 months ago

With the help of ramiro we set the READTIMEOUT variable to 60s and executed the command so that i could review the logs of the webhook-sentry service.

The problem of "Socket idle read time out reached" no longer occurred, the longest response took 35968ms (~36s) so it seems that 60s are enough.

With this the command should no longer have problems collecting the latest free documents.

The next step is to complete the command to sweep each week.

quevon24 commented 1 month ago

READTIMEOUT variable required an update, the value changed from 60s to 90s.

The value was changed because some courts (3 or 4) were exceeding 60s to return the response.

quevon24 commented 1 month ago

This is the latest error found in the webhook-sentry container:

WARNING Error occured reading response from target: context deadline exceeded

Relating this directly to Golang, it indicates that an operation has exceeded the time limit specified in a context, contexts are used to manage the lifetime of operations, handle cancellations, and pass information across function calls. A context can have a deadline that specifies when the context should expire.

In our specific case that performs an operation, such as a network request, it may exceed the time limit specified in the context.

The previous thing I mentioned is in this part of the code: https://github.com/juggernaut/webhook-sentry/blob/master/proxy/proxy.go#L208

The maximum context time is set to 60s by default here: https://github.com/juggernaut/webhook-sentry/blob/master/proxy/config.go#L39

Some post requests exceeded 60000ms, increasing the READTIMEOUT value did not work because now the limitation is the context timeout.

The CONNECTIONLIFETIME environment variable was set to 90s (I previously mentioned that it is set to 60s by default), this is responsible for limiting the execution time of the context.

Tomorrow, after the command is executed, I will add an update to see if this change was enough.

mlissner commented 1 month ago

Getting some Go expertise in here. Good stuff!

quevon24 commented 1 month ago

It seems that the problem has been solved, there were no sentry events today.

Taking a look to the logs, the longest response time was from nvd and it took 80822ms.

mlissner commented 1 month ago

Great. @albertisfu is running his major iquery backscrape, so let's wait for that to finish and then do our backscrape of this content nice and slowly. :)