finddx / FINDCov19TrackerData

7 stars 10 forks source link

FIND COVID-19 Test Data Collection

Scrape test data

Announcement

As of 15th of April 2023, we have stopped updating the COVID-19 testing data. The number of countries tracking and reporting data for COVID-19 tests has significantly reduced after the emergence of Omicron variant. This dataset has provided valuable insight during the pandemic to partners and global health stakeholders, and was the main source of testing data for the Diagnostics pillar of ACT-Accelerator. The accompanying shiny application will be maintained online as resource for historical data.

General description

Several countries and entities, including the World Bank, publish aggregate estimates on the total number of tests performed. These reports are published across individual websites and press releases – often in multiple languages and updated with different periodicity.

The FIND team collects test data from information found online. It combines it with case and deaths data from John Hopkins University and displays them in an interactive tracker dashboard.

This repository contains the intermediate and final data of the data collection process.

Available Data

Sources

Test data: Collated every day by the FIND team from information found online. A large fraction is automated via Python and R (see below). A minor fraction is gathered by manual visits to the respective country websites. Generally, the official government websites of each country are consulted.

Case data: Downloaded daily from the COVID19 John Hopkins University (JHU) repository.

Aggregation

When aggregating over periods and/or groups, we apply the following principles in turn:

1. Aggregation over period: If data is missing during more than 25% of the most recent observations, the period is considered incomplete, no aggregated value is computed.

2. Aggregation over group: Groups aggregations use all the countries for which data is available. If a ratio is computed (e.g., per capita measures, positivity rate), we only consider observations that have values both for the nominator and the denominator. E.g., to calculate tests per capita for a continent, a country is only used if it reports both test and population data.

When we aggregate both over period and group, we do period aggregation first and group aggregation second.

The codebook provides a detailed description of how these two steps look for each variable.

Workflow Description

High-level Overview

The main part of the test data is scraped in an automated fashion by combining Python and R-based solutions. COVID-19 tests are queried twice daily (early in the morning and late in the evening). Because countries change their way of reporting from time to time, manual action is needed for some countries.

  1. Most countries are scraped via Python using Selenium or json libraries, and place in automated/selenium/.
  2. Countries that report in PDF (or other non-HTML formats) are queried via R functions and placed in automated/fetch/.
  3. Lastly, country information gathered via manual website visits is added and combined into a single information source listing the number of tests from all different sources (located at automated/merged/).

The R package {FindCovTracker}, which powers most of the automated actions run via GitHub Actions, takes this data and writes processed/coronavirus_test.csv.

In a final step, the workflow combines the data with case and deaths data from John Hopkins University and writes processed/data_all.csv which is being used as the input for the Shiny app.

Details

This section explains the workflow in greater detail, including links to all R functions from the {FindCovTracker} R package and how conflicts/errors are handled in the individual stages.

1. Test data scraping via Selenium

2. Test data scraping via "R fetch functions"

3. Combination of Selenium, "R fetch functions", and manual updates

The third step in the CI workflow combines the results from Selenium, fetch functions, and manual updates when they are available in manual/processed/. The function get_test_data() writes out a combined data source to automated/merged/. In addition, the list with countries that errored (all-countries-error.csv) is written.

4. Analysis of Workflow Run

The last step performs some analysis on the previous workflow steps. In particular, combine_all_tests()

This step exists twice in the GHA workflow file:

The reasoning here is that if the .csv file containing the manual information for countries is uploaded, it should only be merged into the final file. The automated data scraping should not be triggered again since a new run could lead to new failures for some countries. These new failing countries would then be missing for the day since they were not processed manually beforehand.

5. Combine with other data

COVID-19 Case data is processed in the following way:

A final step combines case and test data and aggregates data into groups:

FINDCov19Tracker::create_shiny_data(): makes use of coronavirus_cases.csv and coronavirus_tests.csv. The function writes processed/data_all.csv, which is being used as the input for the Shiny app.