MattTriano / analytics_data_where_house

An analytics engineering sandbox focusing on real estates prices in Cook County, IL
https://docs.analytics-data-where-house.dev/
GNU Affero General Public License v3.0
9 stars 0 forks source link

Develop a census data intake pipeline using the newly refactored flow #111

Open MattTriano opened 1 year ago

MattTriano commented 1 year ago

Try to match the task grouping from issue #107

MattTriano commented 1 year ago

Regarding implementation of the metadata collector, I think I can achieve this by recursively scraping the "file system" descended from the root node (https://www2.census.gov/). In the data structure on my end, I think I'll create a tree where the id of each node instance is the URL and the node class has methods to:

The census_metadata table will need columns:

I don't want to automatically pull down all of the census data as that's a gigantic volume of data, and there are a lot of different file_types (most of which won't be handled until a need for the data or content arises).

MattTriano commented 1 year ago

My scraper for www2.census.gov metadata has been running since yesterday (I put in a 0.75 * random(0,1) + 0.5 second sleep between requests to avoid just hammering their server) and I've collected over 1.15M URL endpoints, and it looks like I'm still a long way from scraping everything. Still, it shouldn't take anywhere near this long in the future.

I might want to just use this side to short-circuit unnecessary data updates and use the census APIs (here's a json dict of endpoints) to act as the data source (rather than the endpoints from the metadata-side URL. I'll have to see how the data is formatted and organized, but using the APIs might allow me to sidestep a lot of explicit combining of files from the metadata-side endpoints.

MattTriano commented 1 year ago

Oh hey, the API json data menu (https://api.census.gov/data.json, .html and .xml also work) includes a modified field that contains datelike values. That would probably be much easier to work with than the stuff I built yesterday (although that work still has a lot of value as it will produce a full list of the pdfs documenting these data sets).

MattTriano commented 1 year ago

The scraper ran through the week and I cut it off this morning. It still has the following URLs to scrape (see list below; it has been scraping in a depth-first search pattern)

[ 'https://www2.census.gov/2020Census', 'https://www2.census.gov/EEO_2006_2010', 'https://www2.census.gov/EEO_2014_2018', 'https://www2.census.gov/EEO_Disability_2008-2010', 'https://www2.census.gov/Econ2001_And_Earlier', 'https://www2.census.gov/about', 'https://www2.census.gov/acs', 'https://www2.census.gov/acs2002', 'https://www2.census.gov/acs2003', 'https://www2.census.gov/acs2004', 'https://www2.census.gov/acs2005', 'https://www2.census.gov/acs2005_2007_3yr', 'https://www2.census.gov/acs2005_2009_5yr', 'https://www2.census.gov/acs2006', 'https://www2.census.gov/acs2006_2008_3yr', 'https://www2.census.gov/acs2007_1yr', 'https://www2.census.gov/acs2007_2009_3yr', 'https://www2.census.gov/acs2007_3yr', 'https://www2.census.gov/acs2008_1yr', 'https://www2.census.gov/acs2008_3yr', 'https://www2.census.gov/acs2009_1yr', 'https://www2.census.gov/acs2009_3yr', 'https://www2.census.gov/acs2009_5yr', 'https://www2.census.gov/acs2010_1yr', 'https://www2.census.gov/acs2010_3yr', 'https://www2.census.gov/acs2010_5yr', 'https://www2.census.gov/acs2010_SPT_AIAN', 'https://www2.census.gov/acs2011_1yr', 'https://www2.census.gov/acs2011_3yr', 'https://www2.census.gov/acs2011_5yr', 'https://www2.census.gov/acs2012_1yr', 'https://www2.census.gov/acs2012_3yr', 'https://www2.census.gov/acs2012_5yr', 'https://www2.census.gov/acs2013_1yr', 'https://www2.census.gov/acs2013_3yr', 'https://www2.census.gov/acs2013_5yr', 'https://www2.census.gov/acs_latest_data', 'https://www2.census.gov/acs_special_tabs', 'https://www2.census.gov/adrm', 'https://www2.census.gov/cac', 'https://www2.census.gov/census_1940', 'https://www2.census.gov/census_1980', 'https://www2.census.gov/census_1990', 'https://www2.census.gov/census_2000', 'https://www2.census.gov/census_2010', 'https://www2.census.gov/ces', 'https://www2.census.gov/data', 'https://www2.census.gov/decennial', 'https://www2.census.gov/desen002', 'https://www2.census.gov/dssd', 'https://www2.census.gov/econ', 'https://www2.census.gov/econ1977', 'https://www2.census.gov/econ1982', 'https://www2.census.gov/econ1987', 'https://www2.census.gov/econ1992', 'https://www2.census.gov/econ1997', 'https://www2.census.gov/econ2002', 'https://www2.census.gov/econ2003', 'https://www2.census.gov/econ2004', 'https://www2.census.gov/econ2005', 'https://www2.census.gov/econ2006', 'https://www2.census.gov/econ2007', 'https://www2.census.gov/econ2008', 'https://www2.census.gov/econ2009', 'https://www2.census.gov/econ2010', 'https://www2.census.gov/econ2011', 'https://www2.census.gov/econ2012', 'https://www2.census.gov/econ2013', 'https://www2.census.gov/econ2014', 'https://www2.census.gov/econ2015', 'https://www2.census.gov/econ2016', 'https://www2.census.gov/econ2017', 'https://www2.census.gov/foia', 'https://www2.census.gov/geo/maps/DC2010', 'https://www2.census.gov/geo/maps/DC2020/ACO20', 'https://www2.census.gov/geo/maps/DC2020/AIANWall2020', 'https://www2.census.gov/geo/maps/DC2020/DC20BLK', 'https://www2.census.gov/geo/maps/DC2020/IFAC', 'https://www2.census.gov/geo/maps/DC2020/MCS', 'https://www2.census.gov/geo/maps/DC2020/PL20', 'https://www2.census.gov/geo/maps/DC2020/PL20Proto', 'https://www2.census.gov/geo/maps/DC2020/PSAPV', 'https://www2.census.gov/geo/maps/DC2020/PUMA', 'https://www2.census.gov/geo/maps/DC2020/PopCenter', 'https://www2.census.gov/geo/maps/DC2020/PopDist_Nighttime', 'https://www2.census.gov/geo/maps/DC2020/SLD_RefMap', 'https://www2.census.gov/geo/maps/DC2020/SR20', 'https://www2.census.gov/geo/maps/DC2020/TEA' ]