d4bl / COVID19_tracker_data_extraction

Data is often not collected by Black communities when it is needed the most. We have compiled a list of all of the states that have shared data on COVID-19 infections and deaths by race and those who have not. This effort is to extract this data from websites to track disparities COVID-19 deaths and cases for Black people.
34 stars 7 forks source link

South Carolina scraper fails with timeout and/or data structure change #124

Open nkrishnaswami opened 3 years ago

nkrishnaswami commented 3 years ago

The scraper fails with the following

  Waiting timed out in 60 seconds
  1. GoToURL(url=https://public.tableau.com/views/EpiProfile/DemoStory?:embed=y&:showVizHome=no)
  2. WaitFor(locators=[('xpath', '//canvas')], condition=Condition.NUMBER_OF_ELEMENTS,number_of_elements=32, timeout=60)
  3. GetRequest(key=cases, find_by=find_tableau_request)
  4. ClearRequests()
  5. FindElement(method=xpath, xpath=//*[@id="tabZoneId4"]/div/div/div/span[2]/div/span/span/span[2]/div[2]/div,ignore_missing=False, context_key=None)
  6. ClickOn(last_element=True, saved_element_name=None)
  7. WaitFor(locators=[('xpath', "//span[contains(text(), 'Deaths')]")], condition=Condition.NUMBER_OF_ELEMENTS,number_of_elements=6, timeout=60)
  8. GetRequest(key=deaths, find_by=<lambda>)
nkrishnaswami commented 3 years ago

New error:

ERROR covid19_scrapers.scraper:  'NoneType' object has no attribute 'text'
Traceback (most recent call last):
  File "/mnt/c/cygwin64/home/nkrishna/COVID19_tracker_data_extraction/workflow/python/covid19_scrapers/scraper.py", line 55, in run
    rows = self._scrape(start_date=start_date, end_date=end_date,
  File "/mnt/c/cygwin64/home/nkrishna/COVID19_tracker_data_extraction/workflow/python/covid19_scrapers/states/south_carolina.py", line 36, in _scrape
    date = self.parse_date(results.page_source)
  File "/mnt/c/cygwin64/home/nkrishna/COVID19_tracker_data_extraction/workflow/python/covid19_scrapers/states/south_carolina.py", line 26, in parse_date
    raw_date_str = page_source.find('em', string=date_pattern).text