freelawproject / juriscraper

An API to scrape American court websites for metadata.
https://free.law/juriscraper/
BSD 2-Clause "Simplified" License
378 stars 111 forks source link

recap.email: ParserError: Document is empty #862

Open sentry-io[bot] opened 10 months ago

sentry-io[bot] commented 10 months ago

@grossir, @flooie: We'll want to prioritize this one and get it analyzed as quickly as possible.


Sentry Issue: COURTLISTENER-635

ParserError: Document is empty
(7 additional frame(s) were not displayed)
...
  File "cl/recap/tasks.py", line 2250, in process_recap_email
    appellate_doc_num = get_document_number_for_appellate(
  File "cl/corpus_importer/tasks.py", line 1718, in get_document_number_for_appellate
    document_number = get_document_number_from_confirmation_page(
  File "cl/corpus_importer/tasks.py", line 1679, in get_document_number_from_confirmation_page
    doc_num_report.query(pacer_doc_id)
grossir commented 10 months ago

The document in question was parsed successfully in another processing queue

From the Sentry issue, the failing document is from court ca4 and has pacer_doc_id '00409653890' On courtlistener API, we get 3 matches on RecapProcessingQueue for this pacer_doc_id and court, where 2 have processed succesfully and one has failed. The failing queue's id matches the one reported on Sentry <ProcessingQueue: 12151650>

We also have their corresponding Email Processing Queues, all with the same message_id mj7ca7f39vp36ldrtnlv6qjp1nm5rg5emj1l0rg1, all created within less than 3 minutes of each other

From all these queues, one document has been processed successfully (CL, API), which matches the document linked in the original email

It has the document_number field that the Sentry issue reports we failed to get on the get_document_number_from_confirmation_page call

I think the real question is why we got 3 queues for the same original message_id

mlissner commented 10 months ago

Is it possible that we got this three times because three accounts are subscribed? I.e., it's the same message but sent to three different @recap.email addresses?

grossir commented 10 months ago

I think that's not the case, since all 3 queues have the same message_id

The Email Processing Queues all have the same "destination_emails" which is a single @recap.email address. In fact, the only significant difference is the date_created field . I cannot see the uploader field since the API doesn't show it and I do not have access to the production DB

On the email file itself, which is the same for all 3 queues, there is only one @recap.email address (grep recap mj7ca7f39vp36ldrtnlv6qjp1nm5rg5emj1l0rg1), which is the same as the one on destination_emails above. I have pasted the email contents as text on the Sentry issue if you want to take a look

mlissner commented 10 months ago

It's possible the lambda that hits the API did retries, but I can't imagine why it would.

But the underlying problem seems to be that the magic link is used when we call get_document_number_from_confirmation_page. Is that right?

grossir commented 10 months ago

I think the magic number is used before on download_pacer_pdf_and_save_to_pq.

The confirmation page download happens in juriscraper and there is no mention of magic numbers

Anyway, it shouldn't get to the get_document_number_from_confirmation_page call, since it tries to get the document number from doctor's /utils/document-number/pdf/ endpoint before, and that actually works for this case

mlissner commented 10 months ago

Hm, could it be that celery did things out of order?