Police-Data-Accessibility-Project / scrapers

Code relating to scraping public police data.
https://pdap.io
GNU General Public License v3.0
160 stars 37 forks source link

PA municipal scraper #240

Open josh-chamberlain opened 9 months ago

josh-chamberlain commented 9 months ago

Context

Related to data source request 102

Pennsylvania publishes municipal, county, and state budgets. It's possible to find individual municipal budgets, which include police budgets, but cumbersome to get a bunch at once. Let's make a scraper which can be run to iterate through the interface and collect them all. Each municipality has its own police force.

Source 1. Municipal and police budget: https://munstats.pa.gov/Reports/ReportInformation2.aspx?report=mAfrForm

Screen Shot 2024-02-06 at 12 39 13 PM

Source 2. Police details: https://munstats.pa.gov/Reports/ReportInformation2.aspx?report=MuniPolice_Excel

Screen Shot 2024-02-06 at 12 34 09 PM

Source 3. Municipal demographics: https://munstats.pa.gov/Reports/ReportInformation2.aspx?report=CountyMuniDemo_Excel

Screen Shot 2024-02-06 at 12 41 40 PM

Requirements

column source
Municipality 1, 2, 3 (use this one to link them together)
MUNI_ID 2, 3
county 2
Population 3
Police Service Type 2
Full Time Officers 2
Part Time Officers 2
Contract to Others? 2
Contracted to Municipalities 2
Contracted from Municipality 2
Regional Name 2
Regional Municipalities 2
2019 Police Expenditures 1
2019 Total Expenditures 1
2020 Police Expenditures 1
2020 Total Expenditures 1
2021 Police Expenditures 1
2021 Total Expenditures 1
2022 Police Expenditures 1
2022 Total Expenditures 1
2023 Police Expenditures 1
2023 Total Expenditures 1

Example

Here's an sample from a manually generated document from ~2020:

Municipality | MUNI_ID | Population | Police Service Type | Full Time Officers | Part Time Officers | Contract to Others? | Contracted to Municipalities | Contracted from Municipality | Regional Name | Regional Municipalities | 2019 Police Expenditures | 2019 Total Expenditures -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- ALEPPO TWP | 20034 | 1825 | Police Service contracted from another municipality | 0 | 0 |   |   | OHIO TWP |   |   | $104,266.00 | $2,253,234.00 ASPINWALL BORO | 20063 | 2916 | Own Municipal Police Force | 7 | 2 | NO |   |   |   |   | $624,961.00 | $3,932,034.00 AVALON BORO | 20093 | 4762 | Own Municipal Police Force | 6 | 7 | NO |   |   |   |   | $1,223,255.00 | $5,713,865.00 BALDWIN BORO | 20123 | 21510 | Own Municipal Police Force | 24 | 0 | NO |   |   |   |   | $3,918,035.00 | $17,874,484.00 BALDWIN TWP | 20154 | 1985 | Own Municipal Police Force | 6 | 0 | NO |   |   |   |   | $775,007.00 | $2,009,493.00 BELL ACRES BORO | 20183 | 1505 | Own Municipal Police Force | 5 | 4 | NO |   |   |   |   | $491,571.00 | $2,284,663.00
maxachis commented 9 months ago

Looking at the web sources, the problem can be broken down into three parts

  1. Iterate through all possible selection options for each page. Where possible, use a "select all" option.
  2. Traverse through the generated table (which are often paginated), extract relevant data
  3. Put relevant data into csv format.

Of these, 3 is fairly straightforward, and it'll be 1 and 2 with the most complexity. Selenium IDE would likely be one option for components requiring a user interface, but there might be easier ways to do it if UX interaction isn't necessary (perhaps through automating the form submissions?). I'd need to take a look at the most up to date libraries for scraping.

Additionally, if this information is provided through alternative means, such as an RSS feed, worth noting.

maxachis commented 9 months ago

Gonna try approaching it with Puppeteer .

josh-chamberlain commented 9 months ago

@maxachis did you find the line for budgets? Can't tell if you deleted your comment or if it's just not showing.

  1. head to https://munstats.pa.gov/Reports/ReportInformation2.aspx?report=mAfrForm
  2. i picked allegheny county, braddock hills, and 2021
  3. page 3 of the report, revenues & expenditures, there's a public safety section—police $533,540
  4. for wilkinsburg, 2021, same thing $3,179,327
maxachis commented 9 months ago

@maxachis did you find the line for budgets? Can't tell if you deleted your comment or if it's just not showing.

  1. head to https://munstats.pa.gov/Reports/ReportInformation2.aspx?report=mAfrForm
  2. i picked allegheny county, braddock hills, and 2021
  3. page 3 of the report, revenues & expenditures, there's a public safety section—police $533,540
  4. for wilkinsburg, 2021, same thing $3,179,327

I was able to find it. I was looking at the wrong url 🙃. Deleted the message because it was just me making an error.

I've got code that, with a few more tweaks, I should be able to use to start pulling the data. I'll start off by just pulling the download urls for everything, rather than downloading the files directly. After that, I'll work on creating another script that will pull the data and process the relevant information.

Primary question for me is where all the data should ultimately go? I should be able to produce the csvs of the data without too much trouble, but where they'll be ultimately placed is another question.

josh-chamberlain commented 9 months ago

@maxachis ok, great! Thank you for working on this. For now, let's just have them go in the same directory as the scraper, in a "data" folder or something. We can use GitHub Actions to run them, if we ever need to, or just manually update and make a PR.

maxachis commented 9 months ago

Additionally, I note that a lot of the Municipal Finance data includes a wealth of information, including on sources of income relating to the police, such as Fines and Forfeits, and "Public Safety Charges for Service". That's outside the scope of this task, but it seems like it could be valuable intel, and it would be possible to extend or modify this scraper to gather that information.

josh-chamberlain commented 9 months ago

@maxachis yeah, agreed—ton of good stuff there. if you make an issue with your ideas to extend this scraper, someone could do the enhancement!

how's this going, btw? need anything?

maxachis commented 9 months ago

So far progressing! I've developed alpha-level code that can iterate through and download the options for Source 1, a script for finding the relevant data in the downloaded excel files, and a SQLite database for caching the attempts (so a person can retain progress if the scraper halts mid-process) and storing some of the relevant information. Still need to sort out kinks, and eventually expand to the other Sources. Helpfully, those other Sources don't seem as demanding as Source 1.

You can track the current status of my scraper here: https://github.com/maxachis/pa_municipal_scraper

maxachis commented 9 months ago

Scraper for Source 1 is progressing apace. I now have it running continuously.

The primary bottleneck is the network speed of the Municipal website, which is slow to respond. I'm able to work around this by utilizing Node.js's concurrent processing, essentially having multiple webscrapers operating at a time (currently 10). In theory, I could increase the number of webscrapers, but each webscraper costs memory, and I only have so much computer. Plus, and perhaps I'm being overly cautious, I'm not sure how many concurrent requests this government website can handle. In theory, even 100 of my scrapers shouldn't pose a problem for it, but I don't know how brittle the backend is.

Still, I can currently process around 50 entries per minute, and that's probably a conservative estimate. At that rate, assuming no interruptions (which is an assumption), processing would be done within 14 hours of continuous operation, AKA I could easily have it done by the next Friday meeting, and likely much sooner. I've currently got a little over 5,800 of approximately 41,000 possible entries scraped, including the majority of Allegheny County (I say majority only because some entries are not available).

I've done spot checking to validate I'm pulling the correct data, but it's possible there are other errors I'll only discover later, which would of course necessitate rerunning some or all of the code.

One note on data integrity: Some of these financial reports do not indicate any police expenditures. By my count, approximately 26% of what I've processed overall (and happily only 3% of Allegheny county) do not report any police expenditures. Why this is, I don't know.

Sources 2 and 3 should be considerably easier. I should be able to pull both through a single request, and after that it's just a matter of parsing the Excel scripts.

maxachis commented 9 months ago

I ran into an instance where the rate at which I was able to download data slowed considerably between Saturday and Sunday. While I'm not sure how plausible it is that they detected that bots were snorkeling up their data, it did bring to bear the question of how to ethically scrape the website. From what I can tell, a decent rule of thumb is to only pull data around as fast as a human user can pull it. Unfortunately, that does mean that pulling additional data will take longer than I had planned.

Fortunately, I do have the majority of the Allegheny County information, and can move forward with that, while still running the scraper continuously to trickle in the remaining data. But this does bring up a few additional questions:

  1. What do we consider to be an acceptable rate with which to pull data from these websites? Currently, at my rate-limited, uh, rate, I pull in about 1-2 entries per minute. Obviously, I'd like to go faster, but don't want to upset anyone managing the Muni website, nor degrade their service.
  2. Might it be prudent to email the webmasters, announce our goals and interests, and request either a data dump or an API to more efficiently scrape the data?
  3. Would we want to pre-emptively pull more data than what we're pulling now, in anticipation of potential future requests?
maxachis commented 9 months ago

Created a draft of this information. Best results are for Allegheny County, with substantial gaps in the Municipal Finance for a number of other counties. results.csv

josh-chamberlain commented 9 months ago

@maxachis thanks for sharing your thought process here.

  1. Most important is not taking the site down or disrupting access; defaulting to human-speed is sensible. Since this data is updated yearly, starting from Allegheny County and getting the rest at a slow trickle seems OK to me. I can anticipate practical limitations to doing this on a personal computer, but we should be well within the free tier of GitHub Actions if we want to set it up that way.
  2. The original requestor did this, and I have heard rumors of a past data dump. This time they have not been responsive.
  3. If you can see anything that you want to use, or that seems obviously useful, you could try that. In general, getting extra data in anticipation of future requests has not been worth the effort.

The results look great! My suggestion would be to submit the results with the code, so that the scraper's README contains:

maxachis commented 8 months ago

@josh-chamberlain Where should the results of this data be stored? HuggingFace? Airtable? Somewhere else?

josh-chamberlain commented 8 months ago

@maxachis unless the files are too big, I think we should just keep them in this repository. Self-contained, fewer moving parts. Thoughts?

maxachis commented 8 months ago

@maxachis unless the files are too big, I think we should just keep them in this repository. Self-contained, fewer moving parts. Thoughts?

This is doable. Note that recommended Github Repo size is less than 5 Gigabytes. Unclear what the repository size is currently, but regardless, we should be able to store the data within here without adding too much. Assuming we were able to get data for every single municipality for all 15-ish years (which is a substantial if) the total amount of rows would amount to around 40,000. Probably not wise to download to your iPod Nano, but should be doable for the repo.

josh-chamberlain commented 5 months ago

@maxachis i remember your scraper working well; want to submit it and call this closed?

maxachis commented 5 months ago

@josh-chamberlain Can do! May take me a second while I work through other parts, though, unless you want me to put this at the head of the queue.

josh-chamberlain commented 5 months ago

@maxachis great! This isn't urgent, but it is a nice utility for anyone in the state. Even in the state you used it, which might be "incomplete", it could be committed to the scrapers repo and used in the future; there's certainly worse/broken code there.