dsfsi / covid19za

Coronavirus COVID-19 (2019-nCoV) Data Repository and Dashboard for South Africa
https://dsfsi.github.io/covid19za-dash/
MIT License
255 stars 201 forks source link

[Feature] Additional data sources #927

Closed rudigiesler closed 2 years ago

rudigiesler commented 2 years ago

Is your feature request related to a problem? Please describe. For the ContactNDoH WhatsApp line, currently the covid cases data is manually updated daily, but we're wanting to automate it. We've tried to get access to an official API from the NDoH, but haven't been able to succeed with that yet, so we're investigating scraping that data from official sources. For that we need: total cases, new/latest cases (past day) both total and per province, total full recoveries, total deaths, total vaccines administered, and timestamp of when it was last updated.

For total cases and new/latest cases, that is stored in covid19za_provincial_cumulative_timeline_confirmed.csv, which seems to either be pulled from the HTML of the NICD's website, or gis_nicd_scraper, but I cannot find where that is defined through some brief searching in the repo. When doing some searching for this info, it's contained in https://sacoronavirus.co.za/covid-19-daily-cases/ , which has an embedded dashboard which gets it's data from https://gis.nicd.ac.za/hosting/rest/services/WARDS_MN/MapServer/0/query , which offers a JSON API where we can easily pull data. It provides all the way down to ward level data, but it does only supply current totals, as well as latest totals for the last day, but not historical data. I didn't see this data source being used in this repo, and wondering if there's a reason for that.

For full recoveries, deaths, and vaccines administered, there are totals at the bottom of the homepage https://sacoronavirus.co.za/ , which are quite easily scraped. These are unfortunately just served as HTML, so I couldn't find a source for where these numbers are being pulled. In this repo, it seems like that is being fetched from the daily images using OCR.

Describe the solution you'd like Initially I was going to create a background task that would scrape the data from the above mentioned sources, store it in a database, and expose an API for the data, along with some basic checks of the data to ensure accuracy (ensure that totals are always increasing, poll every hour and if the data has changed then append with a timestamp, etc).

If there's a way that we could not duplicate efforts, then that would be great. I'd like to understand if there are any reasons for scraping from the sources in this repo, vs the sources I have listed above.

We'll probably want to host the scraper and database on our servers, to ensure that we can fix things quickly if they break, but it will be open source code, and an open API that this repo could scrape.

Describe alternatives you've considered

Additional context You can find the current manually updated content on the whatsapp line here: https://wa.me/27600123456?text=cases

rudigiesler commented 2 years ago

I've put together scraping and APIs for the 3 sources:

For the images, https://evds-healthcheck-django-prd.covid19-k8s.prd-p6t.org/v2/covidcases/sacoronavirus_images/

For the counters on the homepage: https://evds-healthcheck-django-prd.covid19-k8s.prd-p6t.org/v2/covidcases/sacoronavirus_counters/

For the NICD GIS, after we started scraping and had a history of this source, seems like it's not updated very regularly, and not very reliable (the latest field is often just 0). So we won't be using it to get more detailed breakdowns, but it will continue being scraped and stored. It's available at https://evds-healthcheck-django-prd.covid19-k8s.prd-p6t.org/v2/covidcases/wardcase/ (along with /province, /district, /subdistrict, and /ward), but there's also a flat/denormalized view at https://evds-healthcheck-django-prd.covid19-k8s.prd-p6t.org/v2/covidcases/wardcase/flat

https://evds-healthcheck-django-prd.covid19-k8s.prd-p6t.org/v2/covidcases/contactndoh/ will give you the latest image and counter data, and if we have the day before that, the daily counts. This is what we use to generate the message on ContactNDoH.

Autogenerated docs are available at https://evds-healthcheck-django-prd.covid19-k8s.prd-p6t.org/docs .

rudigiesler commented 6 months ago

Just an update, this scraping has broken, but since we're no longer using this for any of our services, we don't want to commit resources to updating it. So the data at this API will no longer be receiving updates