0_parse.py only scrapes 3 pages

adamrlinder commented 3 years ago

0_parse.py, which scrapes New Criminal Filings and is the entry point of the whole docket downloading workflow, has a hardcoded limit that it scrapes only 3 pages of cases from the website. This means we have missed out on data over the last several months and should be fixed ASAP.

The script needs to be updated to determine how many pages of cases there are and scrape all of them.

jshin313 commented 3 years ago

The following code can get the number of pages available by looking at the number of page links from the New Criminal Filing Website:

    # Determine page count
    source = requests.get(PAGE_URL, params = {"search": record_date}).text
    soup = BeautifulSoup(source)
    ul = soup.findAll("ul", {"class": "pagination"})[0]

    # Remove last entry since that's just the the link to the next or ">>" button
    pages = ul.findAll("li", recursive=False)[:-1]

    num_pages = len(pages)
    end_page = num_pages

I don't know if it's needed, but it's another option.

adamrlinder commented 3 years ago

Merged in a fix. Closing.

CodeForPhilly / pbf-scraping

0_parse.py only scrapes 3 pages #50