freelawproject / juriscraper

An API to scrape American court websites for metadata.
https://free.law/juriscraper/
BSD 2-Clause "Simplified" License
341 stars 98 forks source link

RECAP email parsing error #697

Open sentry-io[bot] opened 1 year ago

sentry-io[bot] commented 1 year ago

Just gotta stay on top of these in case this is a new format we haven't seen yet.


Sentry Issue: COURTLISTENER-44Z

ValueError: empty separator
(3 additional frame(s) were not displayed)
...
  File "cl/recap/tasks.py", line 2165, in process_recap_email
    data, body = open_and_validate_email_notification(self, epq)
  File "cl/recap/tasks.py", line 2062, in open_and_validate_email_notification
    data = report.data
grossir commented 5 months ago

The URL directs to a The issue you were looking for was not found. in Sentry

grossir commented 5 months ago

The email that causes this error is pvfopi0rd9vie91iu3fq6k6d5v77nq1j61mvtg01 for court azd . It seems to be a single event issue, I have found no other instance in Sentry.

I have pasted the full traceback at the end.

The problem is that the email has weird formatting, that causes the juriscraper/pacer/email/_get_case_name_plain function to pick an empty string as case name, even when the case name exists.

So, when trying to split the subject by case name subject.split(case_name) to get the short description, str.split('') causes a ValueError

juriscraper parses this as text/plain email; I forced it to parse it as text/html and it doesn't error, but still fails to pick up the case name

The email looks like this image

but is formatted internally like this grep -A4 -B2 Name: pvfopi0rd9vie91iu3fq6k6d5v77nq1j61mvtg01.eml

The following transaction was entered on 6/30/2023 at 4:15 PM MST and filed=
 on 6/30/2023
Case Name:
Crews v. DeSantis
Case Number:
2:23-cv-00969-MTL<https://ecf.azd.uscourts.gov/cgi-bin/DktRpt.pl?1336577>
Filer:
--
<td style=3D"padding:.75pt .75pt .75pt .75pt">
<p class=3D"MsoNormal"><strong><span style=3D"font-family:&quot;Calibri&quo=
t;,sans-serif">Case Name:</span></strong>
<o:p></o:p></p>
</td>
<td style=3D"padding:.75pt .75pt .75pt .75pt">
<p class=3D"MsoNormal">Crews v. DeSantis<o:p></o:p></p>

The newline between "Case Name:" and the value makes the regex = r"Case Name:(.*)" pick up an empty string. This seems like a bug in itself, it should be regex = r"Case Name:(.+)". But that would not this specific ValueError

Full traceback:

ValueError                                Traceback (most recent call last)
Cell In[5], line 1
----> 1 report.data

File ~/venvs/courtlistener/lib/python3.12/site-packages/juriscraper/pacer/email.py:76, in NotificationEmail.data(self)
     69 base = {
     70     "court_id": self.court_id,
     71 }
     72 if self.content_type == "text/plain":
     73     parsed = {
     74         "appellate": self._is_appellate(),
     75         "contains_attachments": self._contains_attachments_plain(),
---> 76         "dockets": self._get_dockets(),
     77         "email_recipients": self._get_email_recipients_plain(),
     78     }
     79 else:
     80     parsed = {
     81         "appellate": self._is_appellate(),
     82         "contains_attachments": self._contains_attachments(),
     83         "dockets": self._get_dockets(),
     84         "email_recipients": self._get_email_recipients(),
     85     }

File ~/venvs/courtlistener/lib/python3.12/site-packages/juriscraper/pacer/email.py:387, in NotificationEmail._get_dockets(self)
    381 if self.content_type == "text/plain":
    382     docket_number = self._get_docket_number_plain()
    383     docket = {
    384         "case_name": self._get_case_name_plain(),
    385         "docket_number": docket_number,
    386         "date_filed": None,
--> 387         "docket_entries": self._get_docket_entries(),
    388     }
    389     dockets.append(docket)
    390     # Cache the docket number for its later use.

File ~/venvs/courtlistener/lib/python3.12/site-packages/juriscraper/pacer/email.py:471, in NotificationEmail._get_docket_entries(self, current_node)
    464         case_url = self._get_case_anchor(current_node)
    466 if description is not None:
    467     entries = [
    468         {
    469             "date_filed": self._get_date_filed(),
    470             "description": description,
--> 471             "short_description": self._get_short_description(),
    472             "document_url": document_url,
    473             "document_number": document_number,
    474             "pacer_doc_id": None,
    475             "pacer_case_id": None,
    476             "pacer_seq_no": None,
    477             "pacer_magic_num": None,
    478         }
    479     ]
    480     if document_url is not None:
    481         entries[0]["pacer_doc_id"] = get_pacer_doc_id_from_doc1_url(
    482             document_url
    483         )

File ~/venvs/courtlistener/lib/python3.12/site-packages/juriscraper/pacer/email.py:618, in NotificationEmail._get_short_description(self)
    613 subject = clean_string(self.subject)
    614 for case_name in self.case_names:
    615     # cases_names is a list of strings that can contain one or multiple
    616     # elements in multi-docket NEF where the case_name referenced in the
    617     # subject might change. This find the right case_name match.
--> 618     subject_split_case_name = subject.split(case_name)
    619     if len(subject_split_case_name) > 1:
    620         break

ValueError: empty separator
grossir commented 5 months ago

By the way, I wrote a wiki entry on how to find these

https://github.com/freelawproject/courtlistener/wiki/Finding-recap.email-errors-with-deleted-Sentry-issues

mlissner commented 5 months ago

Thanks @grossir. Are you working on the fix or are you saying something more is needed to deal with this?

mlissner commented 5 months ago

Thanks for the wiki page too. Super helpful. As we/you write more of these, we can also think about ways of linking them to the full wiki, so people can find them more easily.