Closed lrossouw closed 3 years ago
I've done this. An example of an automated commit in my fork: https://github.com/lrossouw/covid19za/commit/af1390e16693170dea833dfa0181d088ddd24a27
I've also just delete a couple of weeks data on my fork to see how well it does in updating data.
My test above was successful processing 2 weeks data. Stopping 3 times due to NICD site changes but not once committing incorrect information. Information was identical other than the source url.
Closed by b0adcaf2ee170745b7536b7f4c5a549c803b886b
Thanks @lrossouw this is so awesome. We can then reduce chances of error.
NP. It should post within 15min or so of the page going up on NICD's site.
This is really awesome. Well done.
Thanks, what I can also mention that is if the process fails on a particular day due to NICD messing with the url, or format of the page, someone can still capture manually. The scraper will then notice this and move on to the next day.
I will probably build a web scraper for the NICD data probably using R.
Probably see it working as follows running in an hourly cron:
https://www.nicd.ac.za/latest-confirmed-cases-of-covid-19-in-south-africa-%d-%B-%Y/
. E.g. https://www.nicd.ac.za/latest-confirmed-cases-of-covid-19-in-south-africa-6-nov-2020/. This check will be done on the last date in the CSV + 1 (to avoid missing days).Anyone have any concerns?
Sanity checks to stop the process on any of:
Thoughts?