codeforsanjose / OpenDSJ-2018

Inform voters about 2018 San José, California and local candidates' campaign finance.
MIT License
7 stars 9 forks source link

Selenium Webscraper (ipynb & py files commit) #67

Closed anniejstein closed 5 years ago

anniejstein commented 5 years ago

SouthTech hosting grabs both excel & pdf files. Auto creates directory for collecting files then creates additional directories to group by candidate/platform (includes type of election, data, and title). Expect for scraping to take roughly 2-3 hours (may be more).

PDFs and excels are currently not labeled. Can have information scraped eventually, but need to rename files after download is complete. Figure out way to address duration of scraper (improve speed). Also, would like to make the scraper more adaptable for different south host tech sites (left comments in jupyter notebook for detailed suggestions).