CodeForPhilly / pbf-scraping

Project for Philadelphia Bail Fund to scrape new criminal filings from municipal court
https://codeforphilly.github.io/pbf-scraping
10 stars 4 forks source link

Docket Scraping #1: A script to download the Docket & Court Summary PDFs #12

Closed alteredbritt closed 3 years ago

alteredbritt commented 3 years ago

Need: to scrape the dockets from the PA Court docket search site

Requirements: — scraping script (preferably Python) to download the PDFs — linking an input of Docket # from the New Criminal Filings scraping script to get the PDFs — generates daily

Site link: https://ujsportal.pacourts.us/DocketSheets/MC.aspx

Discussion Notes: — using a headless browser to mimmock a human clicking through — Code for Philly teams doing similar work: PLSE Expungement record parsing — will eventually be stored in data lake for analysis — will update with more soon as I review my notes again!

machow commented 3 years ago

Here's a script to download a docket, based on https://github.com/CLSPhila/RecordLib/blob/87dcc657e39b5a95a7d62cf168e9d0d96c7c26a2/scripts/download_dockets.py#L36

# install dependencies
pip install git+https://github.com/CLSPhila/django-docketsearch.git
pip install django requests lxml aiohttp
import requests
from ujs_search.services import searchujs

r_search = searchujs.search_by_name("Kathleen", "Kane", court = "CP") 

r_link = resp = searchujs.search_by_docket("CP-46-CR-0006239-2015")

url = r_link[0]["docket_sheet_url"]
r_pdf = requests.get(url, headers={"User-Agent": "ParsingThing"})

with open('example_docket.pdf', 'wb') as f: 
    f.write(r_pdf.content) 
adamrlinder commented 3 years ago

I think this is closable now that Hruday has a method for doing this