Open josh-chamberlain opened 9 months ago
Looking at the web sources, the problem can be broken down into three parts
Of these, 3 is fairly straightforward, and it'll be 1 and 2 with the most complexity. Selenium IDE would likely be one option for components requiring a user interface, but there might be easier ways to do it if UX interaction isn't necessary (perhaps through automating the form submissions?). I'd need to take a look at the most up to date libraries for scraping.
Additionally, if this information is provided through alternative means, such as an RSS feed, worth noting.
@maxachis did you find the line for budgets? Can't tell if you deleted your comment or if it's just not showing.
@maxachis did you find the line for budgets? Can't tell if you deleted your comment or if it's just not showing.
- head to https://munstats.pa.gov/Reports/ReportInformation2.aspx?report=mAfrForm
- i picked allegheny county, braddock hills, and 2021
- page 3 of the report, revenues & expenditures, there's a public safety section—police $533,540
- for wilkinsburg, 2021, same thing $3,179,327
I was able to find it. I was looking at the wrong url 🙃. Deleted the message because it was just me making an error.
I've got code that, with a few more tweaks, I should be able to use to start pulling the data. I'll start off by just pulling the download urls for everything, rather than downloading the files directly. After that, I'll work on creating another script that will pull the data and process the relevant information.
Primary question for me is where all the data should ultimately go? I should be able to produce the csvs of the data without too much trouble, but where they'll be ultimately placed is another question.
@maxachis ok, great! Thank you for working on this. For now, let's just have them go in the same directory as the scraper, in a "data" folder or something. We can use GitHub Actions to run them, if we ever need to, or just manually update and make a PR.
Additionally, I note that a lot of the Municipal Finance data includes a wealth of information, including on sources of income relating to the police, such as Fines and Forfeits, and "Public Safety Charges for Service". That's outside the scope of this task, but it seems like it could be valuable intel, and it would be possible to extend or modify this scraper to gather that information.
@maxachis yeah, agreed—ton of good stuff there. if you make an issue with your ideas to extend this scraper, someone could do the enhancement!
how's this going, btw? need anything?
So far progressing! I've developed alpha-level code that can iterate through and download the options for Source 1, a script for finding the relevant data in the downloaded excel files, and a SQLite database for caching the attempts (so a person can retain progress if the scraper halts mid-process) and storing some of the relevant information. Still need to sort out kinks, and eventually expand to the other Sources. Helpfully, those other Sources don't seem as demanding as Source 1.
You can track the current status of my scraper here: https://github.com/maxachis/pa_municipal_scraper
Scraper for Source 1 is progressing apace. I now have it running continuously.
The primary bottleneck is the network speed of the Municipal website, which is slow to respond. I'm able to work around this by utilizing Node.js's concurrent processing, essentially having multiple webscrapers operating at a time (currently 10). In theory, I could increase the number of webscrapers, but each webscraper costs memory, and I only have so much computer. Plus, and perhaps I'm being overly cautious, I'm not sure how many concurrent requests this government website can handle. In theory, even 100 of my scrapers shouldn't pose a problem for it, but I don't know how brittle the backend is.
Still, I can currently process around 50 entries per minute, and that's probably a conservative estimate. At that rate, assuming no interruptions (which is an assumption), processing would be done within 14 hours of continuous operation, AKA I could easily have it done by the next Friday meeting, and likely much sooner. I've currently got a little over 5,800 of approximately 41,000 possible entries scraped, including the majority of Allegheny County (I say majority only because some entries are not available).
I've done spot checking to validate I'm pulling the correct data, but it's possible there are other errors I'll only discover later, which would of course necessitate rerunning some or all of the code.
One note on data integrity: Some of these financial reports do not indicate any police expenditures. By my count, approximately 26% of what I've processed overall (and happily only 3% of Allegheny county) do not report any police expenditures. Why this is, I don't know.
Sources 2 and 3 should be considerably easier. I should be able to pull both through a single request, and after that it's just a matter of parsing the Excel scripts.
I ran into an instance where the rate at which I was able to download data slowed considerably between Saturday and Sunday. While I'm not sure how plausible it is that they detected that bots were snorkeling up their data, it did bring to bear the question of how to ethically scrape the website. From what I can tell, a decent rule of thumb is to only pull data around as fast as a human user can pull it. Unfortunately, that does mean that pulling additional data will take longer than I had planned.
Fortunately, I do have the majority of the Allegheny County information, and can move forward with that, while still running the scraper continuously to trickle in the remaining data. But this does bring up a few additional questions:
Created a draft of this information. Best results are for Allegheny County, with substantial gaps in the Municipal Finance for a number of other counties. results.csv
@maxachis thanks for sharing your thought process here.
The results look great! My suggestion would be to submit the results with the code, so that the scraper's README contains:
@josh-chamberlain Where should the results of this data be stored? HuggingFace? Airtable? Somewhere else?
@maxachis unless the files are too big, I think we should just keep them in this repository. Self-contained, fewer moving parts. Thoughts?
@maxachis unless the files are too big, I think we should just keep them in this repository. Self-contained, fewer moving parts. Thoughts?
This is doable. Note that recommended Github Repo size is less than 5 Gigabytes. Unclear what the repository size is currently, but regardless, we should be able to store the data within here without adding too much. Assuming we were able to get data for every single municipality for all 15-ish years (which is a substantial if) the total amount of rows would amount to around 40,000. Probably not wise to download to your iPod Nano, but should be doable for the repo.
@maxachis i remember your scraper working well; want to submit it and call this closed?
@josh-chamberlain Can do! May take me a second while I work through other parts, though, unless you want me to put this at the head of the queue.
@maxachis great! This isn't urgent, but it is a nice utility for anyone in the state. Even in the state you used it, which might be "incomplete", it could be committed to the scrapers repo and used in the future; there's certainly worse/broken code there.
Context
Related to data source request
102
Pennsylvania publishes municipal, county, and state budgets. It's possible to find individual municipal budgets, which include police budgets, but cumbersome to get a bunch at once. Let's make a scraper which can be run to iterate through the interface and collect them all. Each municipality has its own police force.
Source 1. Municipal and police budget: https://munstats.pa.gov/Reports/ReportInformation2.aspx?report=mAfrForm
Source 2. Police details: https://munstats.pa.gov/Reports/ReportInformation2.aspx?report=MuniPolice_Excel
Source 3. Municipal demographics: https://munstats.pa.gov/Reports/ReportInformation2.aspx?report=CountyMuniDemo_Excel
Requirements
Example
Here's an sample from a manually generated document from ~2020: