MrDiggles2 / cru-scrape

Scraper of CRU sites
0 stars 0 forks source link

Add a script to spit out all combinations of URL and year required #11

Closed MrDiggles2 closed 3 weeks ago

MrDiggles2 commented 1 month ago

For every site in public.sites, use start_year and end_year to get all combinations of site ID, url and year to scrape.

Only "Olympic" years should be pulled. In the case where the timeframe is too short, we should pull the closest year.

"Olympic" years (or years where year % 4 == 0) are used to trim down on the amount of scraping required but also to track changes due to election cycles.

This script should work something like

$> poetry run main get-all-combinations

2000 a458af45-bc38-4fa7-b4b7-5f1d4979fd2d http://www.forestry.auburn.edu/
2004 a458af45-bc38-4fa7-b4b7-5f1d4979fd2d http://www.forestry.auburn.edu/
2008 a458af45-bc38-4fa7-b4b7-5f1d4979fd2d http://www.forestry.auburn.edu/
2016 87ed2093-223c-4796-8f65-4d66f5e2aeb6 http://www.clemson.edu/cafls/departments/fec/index.html
2020 87ed2093-223c-4796-8f65-4d66f5e2aeb6 http://www.clemson.edu/cafls/departments/fec/index.html
...

See src/commands/upload-organizations.py and src/utils/psql.py for examples on how to connect to the DB

MrDiggles2 commented 1 month ago

This will be the basis of the tasks enqueued into https://github.com/MrDiggles2/cru-scrape/issues/19