freelawproject / courtlistener

A fully-searchable and accessible archive of court data including growing repositories of opinions, oral arguments, judges, judicial financial records, and federal filings.
https://www.courtlistener.com
Other
528 stars 144 forks source link

RECAP PDF gone before extracting contents #3783

Open sentry-io[bot] opened 6 months ago

sentry-io[bot] commented 6 months ago

Bit of a weird one. We tried to extract the contents of a PDF, but the PDF isn't in S3. Meanwhile, in the admin panel, we know the sha, page count, and other details of the PDF, so we definitely had it.

The website makes it look like we have it, but we don't.

Sentry Issue: COURTLISTENER-6Q0

NoSuchKey: An error occurred (NoSuchKey) when calling the GetObject operation: The specified key does not exist.
(22 additional frame(s) were not displayed)
...
  File "cl/scrapers/tasks.py", line 245, in extract_recap_pdf
    return async_to_sync(extract_recap_pdf_base)(
  File "cl/scrapers/tasks.py", line 277, in extract_recap_pdf_base
    response = await microservice(
  File "cl/lib/microservice_utils.py", line 84, in microservice
    req = client.build_request(

Filed by @mlissner

mlissner commented 6 months ago

A few screenshots. This is the admin page, before I cleaned it up:

image

And here's the front end, FWIW:

image

When you try to get the document at this link:

https://storage.courtlistener.com/recap/gov.uscourts.cafc.20249/gov.uscourts.cafc.20249.32.3.pdf

It throws this error:

image

mlissner commented 6 months ago

This has been happening awhile, see: https://github.com/freelawproject/courtlistener/issues/3010