Police-Data-Accessibility-Project / scrapers

Code relating to scraping public police data.
https://pdap.io
GNU General Public License v3.0
157 stars 35 forks source link

Sitemap scraper #195

Closed nfmcclure closed 1 year ago

nfmcclure commented 1 year ago

Added files, readme, and sample URL list for sitemap scraper.

josh-chamberlain commented 1 year ago

Thanks for the submission @nfmcclure! I'm testing this now but stuck:

(venv) $ scrapy crawl sitemapspider
Scrapy 2.7.1 - no active project

Unknown command: crawl

Anything else I should do after setting up the venv?

nfmcclure commented 1 year ago

Yup sorry- two changes needed,

  1. Added scrapy to requirements.txt.
  2. change directory to sitemap before running scrapy crawl sitemapspider.

Let me know if that helps.

Edit- ok re-tested in a completely new venv. Fixed some requirements.

josh-chamberlain commented 1 year ago
Screen Shot 2022-11-10 at 11 59 08 AM

This is weird, but still not working! I can try more intensively later 🤷

nfmcclure commented 1 year ago

That is strange! I haven't run into that before. But it seems that can happen, as tackled here:

https://stackoverflow.com/questions/45345377/python-module-not-found-even-though-requirement-already-satisfied-in-pip

josh-chamberlain commented 1 year ago

@nfmcclure cool—from the venv, uninstalling and reinstalling xmltodict (which wasn't installed outside the venv) worked. I was able to get this to run! I'll wait for @thejqs to approve but LGTM

josh-chamberlain commented 1 year ago

Related to https://github.com/Police-Data-Accessibility-Project/data-source-identification/issues/11

josh-chamberlain commented 1 year ago

I used this for some Southeast Arkansas agencies with mixed success:

attempt 1: sample_host_sites.txt 20230131_122648_output.csv

attempt 2: sample_host_sites.txt 20230131_123505_output.csv