RECAP PDF gone before extracting contents

sentry-io[bot] commented 6 months ago

Bit of a weird one. We tried to extract the contents of a PDF, but the PDF isn't in S3. Meanwhile, in the admin panel, we know the sha, page count, and other details of the PDF, so we definitely had it.

The website makes it look like we have it, but we don't.

Sentry Issue: COURTLISTENER-6Q0

NoSuchKey: An error occurred (NoSuchKey) when calling the GetObject operation: The specified key does not exist.
(22 additional frame(s) were not displayed)
...
  File "cl/scrapers/tasks.py", line 245, in extract_recap_pdf
    return async_to_sync(extract_recap_pdf_base)(
  File "cl/scrapers/tasks.py", line 277, in extract_recap_pdf_base
    response = await microservice(
  File "cl/lib/microservice_utils.py", line 84, in microservice
    req = client.build_request(

Filed by @mlissner

mlissner commented 6 months ago

A few screenshots. This is the admin page, before I cleaned it up:

And here's the front end, FWIW:

When you try to get the document at this link:

https://storage.courtlistener.com/recap/gov.uscourts.cafc.20249/gov.uscourts.cafc.20249.32.3.pdf

It throws this error:

mlissner commented 6 months ago

This has been happening awhile, see: https://github.com/freelawproject/courtlistener/issues/3010

freelawproject / courtlistener

RECAP PDF gone before extracting contents #3783