Police-Data-Accessibility-Project / scrapers

Code relating to scraping public police data.
https://pdap.io
GNU General Public License v3.0
157 stars 35 forks source link

Scraper index #208

Closed EvilDrPurple closed 11 months ago

EvilDrPurple commented 1 year ago

For Police-Data-Accessibility-Project/data-source-identification#11

I have a few comments and questions before merging:

  1. I did opt for a different style than was mentioned in the original issue, instead of a markdown table or csv I structured it using collapsible menus. You can check the markdown file below to see how it looks. I did this because I thought it would have better readability and be easier to find particular scrapers for each state instead of scrolling through potentially hundreds of rows of a table. If you think a table or csv would be better we can convert it to that instead. We could potentially add additional sections to group by county and municipality as well. Example of current layout:

    In this repo
    CA
    San Bernardino County Officer Involved Shootings  (Scraper info)
  2. I wasn't quite sure how go about integrating and testing with Airtable and Actions to pull directly from the database so for now I just downloaded a csv of the dataset and used that. Some help with this would be appreciated :)

  3. I also wasn't sure where to put the script and resulting file so it's just in the root at the moment, some guidance on where to place it would be good

  4. There are only two scrapers in the database that gave a link to this repo (and the links seem to go to strange locations in it), so I'm guessing this means the database just needs to be updated to add the scraper_url to the missing scrapers

josh-chamberlain commented 1 year ago

Thanks for looking at this! I can respond to some of your initial questions tomorrow and we'll move it along.

EvilDrPurple commented 1 year ago

Alright, I moved the file locations and changed the markdown to be a table instead. I did end up scrapping the collapsible menus in favor of just a couple section links at the top. I did look at the mirror previously but for some reason there were columns not being returned in the csv. Particularly the scraper_url column is not part of the mirrored data and that is necessary for the script to function. Is this a bug that needs to be fixed in that repo?

josh-chamberlain commented 1 year ago

@EvilDrPurple Thanks for making these changes! Dang, yes—that mirror repo is not as complete as I had thought. There's someone working on a postgres database + better mirror now, would you mind holding tight on this for a week or two? If you win the lottery and stop paying attention to this project, we can still merge your work so you get commit credit!

EvilDrPurple commented 1 year ago

@josh-chamberlain Yeah sure, not a problem!

josh-chamberlain commented 1 year ago

An update: we have the database set up. Rather than running this from the database directly, we'll publish an API endpoint which makes data sources available. Here's the issue.

EvilDrPurple commented 1 year ago

@josh-chamberlain I discovered scraper_list.md. Will the scraper index be replacing this or are we keeping it? It appears to have been manually maintained a long time ago.

josh-chamberlain commented 1 year ago

@EvilDrPurple I'd say get rid of it as part of your PR—but we may want to make sure the data sources being scraped are represented in our database. If you want, you could import a CSV with these URLs or check to see if they're in the db. The new scraper index will permanently make the db the source of truth for where scrapers are, and then the index you're making will get its info from the db. No more manual updating.

mbodeantor commented 1 year ago

@EvilDrPurple I changed the source for data sources to the prod API. Could you double check my work?

EvilDrPurple commented 1 year ago

@mbodeantor I was able to get it working, I had to add load_dotenv() back for it to work on my end. Is there some way you have the environment set up that allows it to work for you without it?

Also, wasn't sure of your thoughts but do you think we should remove the quotes and brackets to make the state, county, and municipality look a little nicer? image

mbodeantor commented 1 year ago

@EvilDrPurple I just do export PDAP_API_KEY="key" from the cmd line for all the variables locally. They are stored in Github as environment variables so no need for the .env file on that side

mbodeantor commented 1 year ago

For sure, let's remove the list brackets and just display the strings inside

EvilDrPurple commented 1 year ago

@mbodeantor should be all ready to go!

mbodeantor commented 12 months ago

@josh-chamberlain I think you need to resolve your requested changes for this to close