NICD Provincial Data Scraper

lrossouw commented 3 years ago

I will probably build a web scraper for the NICD data probably using R.

Probably see it working as follows running in an hourly cron:

It will pull the covid19za repo and ensure it has all latest commits.
Read in the relevant CSVs
Then check for new pages in the format https://www.nicd.ac.za/latest-confirmed-cases-of-covid-19-in-south-africa-%d-%B-%Y/. E.g. https://www.nicd.ac.za/latest-confirmed-cases-of-covid-19-in-south-africa-6-nov-2020/. This check will be done on the last date in the CSV + 1 (to avoid missing days).
It will scrape that page capturing the three tables (cases, tests, deaths & recoveries).
Sanity checks
Subject to passing checks, update the CSVs in the local repo and push directly to this repository. Not keen to do automate the whole PR process too.

Anyone have any concerns?

Sanity checks to stop the process on any of:

Exceptions
Non-numeric data
Ensure numbers are strictly increasing.
Province name checks
Unrealistic increases? 10% on cumulative per day?

Thoughts?

lrossouw commented 3 years ago

I've done this. An example of an automated commit in my fork: https://github.com/lrossouw/covid19za/commit/af1390e16693170dea833dfa0181d088ddd24a27

I've also just delete a couple of weeks data on my fork to see how well it does in updating data.

lrossouw commented 3 years ago

My test above was successful processing 2 weeks data. Stopping 3 times due to NICD site changes but not once committing incorrect information. Information was identical other than the source url.

lrossouw commented 3 years ago

Closed by b0adcaf2ee170745b7536b7f4c5a549c803b886b

vukosim commented 3 years ago

Thanks @lrossouw this is so awesome. We can then reduce chances of error.

lrossouw commented 3 years ago

NP. It should post within 15min or so of the page going up on NICD's site.

dennisvnel commented 3 years ago

This is really awesome. Well done.

lrossouw commented 3 years ago

Thanks, what I can also mention that is if the process fails on a particular day due to NICD messing with the url, or format of the page, someone can still capture manually. The scraper will then notice this and move on to the next day.

dsfsi / covid19za

NICD Provincial Data Scraper #767