ecohealthalliance / One-Health-Database-Africa

Other
0 stars 0 forks source link

Extracting WHO AFRO report data from WHO dashboard and PDFs #1

Open noamross opened 2 years ago

noamross commented 2 years ago

@sebaum requested that we support extracting data from WHO AFRO's health emergency reports. These reports are found in two places:

PDF scraping will be a more intensive task, so we're tabling that until we see if WHO AFRO has the information available from the PDFs in another format.

In the meantime the scope of this issue is to pull the data available from the GIS dashboard, incorporating the code as a function in this repository that can be re-used. The data are not so large in scale they need to be cached and updated, so just pulling all at once should be fine.

The WHO AFRO GIS dashboard is an app built on the ArcGIS web platform. These generally call ArcGIS server endpoints, which have a common API, allowing extracting data in formats beyond even what the app calls. Looking at the network calls of the app, this server is at https://services.arcgis.com/5T5nSi527N4F7luB/, and the layers called are:

I believe the first layer contains the data we are interested in but it's worth checking discrepancies with the others and the most recent PDF reports.

At the page at each of these endpoints you can construct a query to fetch data. To fetch all data in the layers

This will generate a query URL you can reuse.

Only 2000 features will be fetched at once, so fetching repeatedly with a different "Result Offset" values is probably needed to get all the records.

For each event, we want to extract:

I don't know how repeat reports related to the same event are handled in the database - we may need a slightly more complex schema for this, particularly if we ultimately scrape the weekly PDFs. Consult @sebaum about any particulars of the most useful output structure.

emmamendelsohn commented 2 years ago

The non-PDF scraping is mostly completed here: who-reports. Remaining issues/questions:

I think @sebaum compiled examples of these issues?