dsfsi / covid19za

Coronavirus COVID-19 (2019-nCoV) Data Repository and Dashboard for South Africa
https://dsfsi.github.io/covid19za-dash/
MIT License
255 stars 200 forks source link

NICD Provincial Data Scraper #767

Closed lrossouw closed 3 years ago

lrossouw commented 3 years ago

I will probably build a web scraper for the NICD data probably using R.

Probably see it working as follows running in an hourly cron:

  1. It will pull the covid19za repo and ensure it has all latest commits.
  2. Read in the relevant CSVs
  3. Then check for new pages in the format https://www.nicd.ac.za/latest-confirmed-cases-of-covid-19-in-south-africa-%d-%B-%Y/. E.g. https://www.nicd.ac.za/latest-confirmed-cases-of-covid-19-in-south-africa-6-nov-2020/. This check will be done on the last date in the CSV + 1 (to avoid missing days).
  4. It will scrape that page capturing the three tables (cases, tests, deaths & recoveries).
  5. Sanity checks
  6. Subject to passing checks, update the CSVs in the local repo and push directly to this repository. Not keen to do automate the whole PR process too.

Anyone have any concerns?

Sanity checks to stop the process on any of:

Thoughts?

lrossouw commented 3 years ago

I've done this. An example of an automated commit in my fork: https://github.com/lrossouw/covid19za/commit/af1390e16693170dea833dfa0181d088ddd24a27

I've also just delete a couple of weeks data on my fork to see how well it does in updating data.

lrossouw commented 3 years ago

My test above was successful processing 2 weeks data. Stopping 3 times due to NICD site changes but not once committing incorrect information. Information was identical other than the source url.

lrossouw commented 3 years ago

Closed by b0adcaf2ee170745b7536b7f4c5a549c803b886b

vukosim commented 3 years ago

Thanks @lrossouw this is so awesome. We can then reduce chances of error.

lrossouw commented 3 years ago

NP. It should post within 15min or so of the page going up on NICD's site.

dennisvnel commented 3 years ago

This is really awesome. Well done.

lrossouw commented 3 years ago

Thanks, what I can also mention that is if the process fails on a particular day due to NICD messing with the url, or format of the page, someone can still capture manually. The scraper will then notice this and move on to the next day.