d4bl / COVID19_tracker_data_extraction

Data is often not collected by Black communities when it is needed the most. We have compiled a list of all of the states that have shared data on COVID-19 infections and deaths by race and those who have not. This effort is to extract this data from websites to track disparities COVID-19 deaths and cases for Black people.
33 stars 7 forks source link

Scraper doesn't run due to Census data unavailability #157

Open sydeaka opened 3 years ago

sydeaka commented 3 years ago

@nkrishnaswami

When I tried to run the scraper this evening, I got an error that this 2018 US Census Excel file no longer exists. The file also fails to load when I paste the URL directly into a web browser.

https://www2.census.gov/programs-surveys/popest/geographies/2018/all-geocodes-v2018.xlsx

Unfortunately this means we must halt daily scraper runs until this is resolved.

Do we have a local copy saved? Or, alternatively, could we modify the scraper so that it continues to pull the data while ignoring the unavailable Census file?

The error message is provided below.

2020-10-14 21:35:36,120 INFO covid19_scrapers.web_cache:  Connecting web cache to DB: work/web_cache.db
Traceback (most recent call last):
  File "run_scrapers.py", line 189, in <module>
    main()
  File "run_scrapers.py", line 165, in main
    registry_args=dict(enable_beta_scrapers=opts.enable_beta_scrapers),
  File "/Users/poisson/Documents/GitHub/COVID19_tracker_data_extraction/workflow/python/covid19_scrapers/__init__.py", line 61, in make_scraper_registry
    census_api = CensusApi(census_api_key)
  File "/Users/poisson/Documents/GitHub/COVID19_tracker_data_extraction/workflow/python/covid19_scrapers/census/census_api.py", line 31, in __init__
    self.fips = FipsLookup()
  File "/Users/poisson/Documents/GitHub/COVID19_tracker_data_extraction/workflow/python/covid19_scrapers/census/fips_lookup.py", line 22, in __init__
    df = pd.read_excel(get_content_as_file(self.CODES_URL), skiprows=4)
  File "/Users/poisson/Documents/GitHub/COVID19_tracker_data_extraction/workflow/python/covid19_scrapers/utils/http.py", line 100, in get_content_as_file
    return BytesIO(get_content(url, **kwargs))
  File "/Users/poisson/Documents/GitHub/COVID19_tracker_data_extraction/workflow/python/covid19_scrapers/utils/http.py", line 94, in get_content
    r = get_cached_url(url, **kwargs)
  File "/Users/poisson/Documents/GitHub/COVID19_tracker_data_extraction/workflow/python/covid19_scrapers/utils/http.py", line 59, in get_cached_url
    return UTILS_WEB_CACHE.fetch(url, **kwargs)
  File "/Users/poisson/Documents/GitHub/COVID19_tracker_data_extraction/workflow/python/covid19_scrapers/web_cache.py", line 263, in fetch
    response.raise_for_status()
  File "/Users/poisson/Documents/GitHub/COVID19_tracker_data_extraction/covid19_data_test_003/lib/python3.7/site-packages/requests/models.py", line 941, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 500 Server Error: Internal Server Error for url: https://www2.census.gov/programs-surveys/popest/geographies/2018/all-geocodes-v2018.xlsx
sydeaka commented 3 years ago

Update: A few moments after I created the issue, subsequent refreshes of the website revealed a message saying that the system is down due to maintenance. Perhaps that explains why the file was unavailable. After a few additional moments, the file appeared to be back online and the scraper run resumed without incident.

I will leave this up so that we can work toward a solution that caches the 2018 data table and stores it in the repo for later reference.