freelawproject / juriscraper

An API to scrape American court websites for metadata.
https://free.law/juriscraper/
BSD 2-Clause "Simplified" License
357 stars 106 forks source link

Rhode Island failures #1059

Closed sentry-io[bot] closed 2 months ago

sentry-io[bot] commented 3 months ago

IntegrityError: null value in column "syllabus" of relation "search_opinioncluster" violates not-null constraint

Sentry Issue: COURTLISTENER-7T4

NotNullViolation: null value in column "syllabus" of relation "search_opinioncluster" violates not-null constraint
DETAIL:  Failing row contains (9997092, , 2024-07-03 14:27:38.154395+00, 2024-07-03 14:27:38.154405+00, 2024-06-27, state-of-rhode-island-v-michael-prete, , State of Rhode Island v. Michael Prete, , , C, , , , , null, 0, Published, null, f, 68913048, null, null, null, f, , , , , , , , , , ).
  File "django/db/backends/utils.py", line 105, in _execute
    return self.cursor.execute(sql, params)
  File "psycopg/cursor.py", line 732, in execute
    raise ex.with_traceback(None)

IntegrityError: null value in column "syllabus" of relation "search_opinioncluster" violates not-null constraint
DETAIL:  Failing row contains (9997092, , 2024-07-03 14:27:38.154395+00, 2024-07-03 14:27:38.154405+00, 2024-06-27, state-of-rhode-island-v-michael-prete, , State of Rhode Island v. Michael Prete, , , C, , , , , null, 0, Published, null, f, 68913048, null, null, null, f, , , , , , , , , , ).
(16 additional frame(s) were not displayed)
...
  File "cl/scrapers/management/commands/cl_scrape_opinions.py", line 390, in handle
    self.parse_and_scrape_site(mod, options)
  File "cl/scrapers/management/commands/cl_scrape_opinions.py", line 354, in parse_and_scrape_site
    self.scrape_court(site, options["full_crawl"])
  File "cl/scrapers/management/commands/cl_scrape_opinions.py", line 325, in scrape_court
    save_everything(
  File "cl/scrapers/management/commands/cl_scrape_opinions.py", line 162, in save_everything
    cluster.save(index=False)  # Index only when the opinion is associated.
  File "cl/search/models.py", line 2868, in save
    super().save(update_fields=update_fields, *args, **kwargs)
grossir commented 3 months ago

I think the fix is easiest in courtlistener by doing this: summary=item.get("summary", "") or ""

https://github.com/freelawproject/courtlistener/blob/dee41a789e8ec7b14ce0adf5ee0d491d71743b32/cl/scrapers/management/commands/cl_scrape_opinions.py#L117-L121

We could also take the keys that come from juriscraper, since "summaries" going to syllabus and "summary" going to "summary" is confusing

grossir commented 3 months ago

So, the ri server sends a JSON response, where null values are turned into None, and we are sending those to Courtlistener, and that why it fails. This is part of our lack of output validation; but we can also catch this on the courtlistener side with the or trick I put on the previous comment

flooie commented 2 months ago

A PR to address RI has been merged