Closed grossir closed 1 week ago
Commands to fill the gaps:
docker exec -it cl-django python /opt/courtlistener/manage.py cl_back_scrape_opinions --courts juriscraper.opinions.united_states.state.conn --backscrape-start=2022 --backscrape-end=2023
The PR in progress is also trying to solve an index error on conn
as seen here
Sentry Issue: COURTLISTENER-7TF
Sentry Issue: COURTLISTENER-834
Filed by @grossir
The recently merged backscraper is running, but now it fails on the get_binary_content stage
When ran on standalone juriscraper, it works...
Adding new item:
case_dates: 2024-07-01
case_names: "Woodbridge Newton Neighborhood Environmental Trust v. Connecticut Siting Council"
download_urls: "http://www.jud.ct.gov/external/supapp/Cases/AROcr/CR349/CR349.37.pdf"
precedential_statuses: "Published"
blocked_statuses: False
date_filed_is_approximate: True
docket_numbers: "SC20816"
case_name_shorts: ""
Showing extracted document data (500 chars):
b'%PDF-1.7\r%\xe2\xe3\xcf\xd3\r\n185 0 obj\r<</Linearized 1/L 155750/O 187/E 12301/N 19/T 151929/H [ 629 473]>>\rendobj\r \r\nxref\r\n185 16\r\n0000000016 00000 n\r\n0000001273 00000 n\r\n0000001452 00000 n\r\n0000002522 00000 n\r\n0000002636 00000 n\r\n0000002740 00000 n\r\n0000005390 00000 n\r\n0000005427 00000 n\r\n0000005539 00000 n\r\n0000005623 00000 n\r\n0000009557 00000 n\r\n0000009965 00000 n\r\n0000010470 00000 n\r\n0000010965 00000 n\r\n0000001102 00000 n\r\n0000000629 00000 n\r\ntrailer\r\n<</Size 201/Root 186 0 R/Info 143 0 R/ID['
The backscrape will fail at random parts: connecting to the results page, in the middle of downloading the opinion documents. I actually spammed the command 4/5 times to get it done, but it worked!
Between May 25, 2022 and January 09, 2023 we now have 69 (up from 0 documents).
Part of #929
Between May 25, 2022 and January 09, 2023 we have 0 documents.
We are missing documents because the scraper looks for this string "Published in the Law Journal", but in that year the format looks like "Published in the Connecticut Law Journal". This was fixed on a Jan 26, 2024 commit
708dd14f
by removing this statementif "Published in the Law Journal" not in row.text_content(): continue
, so we only have to implement the backscraper now