freelawproject / juriscraper

An API to scrape American court websites for metadata.
https://free.law/juriscraper/
BSD 2-Clause "Simplified" License
378 stars 111 forks source link

Fill `conn` gap #987

Closed grossir closed 1 week ago

grossir commented 7 months ago

Part of #929

Between May 25, 2022 and January 09, 2023 we have 0 documents.

We are missing documents because the scraper looks for this string "Published in the Law Journal", but in that year the format looks like "Published in the Connecticut Law Journal". This was fixed on a Jan 26, 2024 commit 708dd14f by removing this statement if "Published in the Law Journal" not in row.text_content(): continue, so we only have to implement the backscraper now

grossir commented 6 months ago

Commands to fill the gaps:

docker exec -it cl-django python /opt/courtlistener/manage.py cl_back_scrape_opinions --courts juriscraper.opinions.united_states.state.conn --backscrape-start=2022 --backscrape-end=2023
grossir commented 3 months ago

The PR in progress is also trying to solve an index error on conn as seen here Sentry Issue: COURTLISTENER-7TF

sentry-io[bot] commented 3 months ago

Sentry Issue: COURTLISTENER-834

Filed by @grossir

The recently merged backscraper is running, but now it fails on the get_binary_content stage

When ran on standalone juriscraper, it works...


Adding new item:
    case_dates: 2024-07-01
    case_names: "Woodbridge Newton Neighborhood Environmental Trust v. Connecticut Siting Council"
    download_urls: "http://www.jud.ct.gov/external/supapp/Cases/AROcr/CR349/CR349.37.pdf"
    precedential_statuses: "Published"
    blocked_statuses: False
    date_filed_is_approximate: True
    docket_numbers: "SC20816"
    case_name_shorts: ""
Showing extracted document data (500 chars):
 b'%PDF-1.7\r%\xe2\xe3\xcf\xd3\r\n185 0 obj\r<</Linearized 1/L 155750/O 187/E 12301/N 19/T 151929/H [ 629 473]>>\rendobj\r             \r\nxref\r\n185 16\r\n0000000016 00000 n\r\n0000001273 00000 n\r\n0000001452 00000 n\r\n0000002522 00000 n\r\n0000002636 00000 n\r\n0000002740 00000 n\r\n0000005390 00000 n\r\n0000005427 00000 n\r\n0000005539 00000 n\r\n0000005623 00000 n\r\n0000009557 00000 n\r\n0000009965 00000 n\r\n0000010470 00000 n\r\n0000010965 00000 n\r\n0000001102 00000 n\r\n0000000629 00000 n\r\ntrailer\r\n<</Size 201/Root 186 0 R/Info 143 0 R/ID['
grossir commented 1 week ago

The backscrape will fail at random parts: connecting to the results page, in the middle of downloading the opinion documents. I actually spammed the command 4/5 times to get it done, but it worked!

Between May 25, 2022 and January 09, 2023 we now have 69 (up from 0 documents).