As of 15th of April 2023, we have stopped updating the COVID-19 testing data. The number of countries tracking and reporting data for COVID-19 tests has significantly reduced after the emergence of Omicron variant. This dataset has provided valuable insight during the pandemic to partners and global health stakeholders, and was the main source of testing data for the Diagnostics pillar of ACT-Accelerator. The accompanying shiny application will be maintained online as resource for historical data.
Several countries and entities, including the World Bank, publish aggregate estimates on the total number of tests performed. These reports are published across individual websites and press releases – often in multiple languages and updated with different periodicity.
The FIND team collects test data from information found online. It combines it with case and deaths data from John Hopkins University and displays them in an interactive tracker dashboard.
This repository contains the intermediate and final data of the data collection process.
processed/coronavirus_test.csv
: Test data collected by the FIND team.
processed/data_all.csv
: Test data combined with case and deaths data from John Hopkins University, including group aggregations. An interactive tracker dashboard displays the data.
Test data: Collated every day by the FIND team from information found online. A large fraction is automated via Python and R (see below). A minor fraction is gathered by manual visits to the respective country websites. Generally, the official government websites of each country are consulted.
Case data: Downloaded daily from the COVID19 John Hopkins University (JHU) repository.
When aggregating over periods and/or groups, we apply the following principles in turn:
1. Aggregation over period: If data is missing during more than 25% of the most recent observations, the period is considered incomplete, no aggregated value is computed.
2. Aggregation over group: Groups aggregations use all the countries for which data is available. If a ratio is computed (e.g., per capita measures, positivity rate), we only consider observations that have values both for the nominator and the denominator. E.g., to calculate tests per capita for a continent, a country is only used if it reports both test and population data.
When we aggregate both over period and group, we do period aggregation first and group aggregation second.
The codebook provides a detailed description of how these two steps look for each variable.
The main part of the test data is scraped in an automated fashion by combining Python and R-based solutions. COVID-19 tests are queried twice daily (early in the morning and late in the evening). Because countries change their way of reporting from time to time, manual action is needed for some countries.
automated/selenium/
.automated/fetch/
.automated/merged/
).The R package {FindCovTracker}, which powers most of the automated actions run via GitHub Actions, takes this data and writes processed/coronavirus_test.csv
.
In a final step, the workflow combines the data with case and deaths data from John Hopkins University and writes processed/data_all.csv
which is being used as the input for the Shiny app.
This section explains the workflow in greater detail, including links to all R functions from the {FindCovTracker} R package and how conflicts/errors are handled in the individual stages.
selenium/
is run, specifically python3 selenium/run.py
is executed.automated/selenium/
directory with a prefix of the respective date.new_tests
are calculated from the different to the previous day.selenium/test.py
will be reported as NA
.
The country will also be listed in the all-countries-error.csv.fetch_test_data()
processes countries specified in the respective upstream file with dedicated functions for the given file type (e.g. PDF).The third step in the CI workflow combines the results from Selenium, fetch functions, and manual updates when they are available in manual/processed/
.
The function get_test_data()
writes out a combined data source to automated/merged/
.
In addition, the list with countries that errored (all-countries-error.csv
) is written.
The last step performs some analysis on the previous workflow steps.
In particular, combine_all_tests()
$DATE-need-processing.csv
)coronavirus_tests_new.csv
, which lists information from all dates and all countries that have been processed so far.This step exists twice in the GHA workflow file:
run-analysis
runs when automated scraping has happened before and therefore includes a needs
condition.run-analysis-manual
runs only if the commit message contains manually processed countries
.
Also, in this scenario, the scraping jobs are not triggered.The reasoning here is that if the .csv
file containing the manual information for countries is uploaded, it should only be merged into the final file.
The automated data scraping should not be triggered again since a new run could lead to new failures for some countries.
These new failing countries would then be missing for the day since they were not processed manually beforehand.
COVID-19 Case data is processed in the following way:
FINDCov19Tracker::process_jhu_data()
: main function which starts all John Hopkins University (JHU) data processing.
The functions calls FINDCov19Tracker::preprocess_jhu_data()
and FINDCov19Tracker::check_jhu_data()
and writes processed/coronavirus_cases.csv
.A final step combines case and test data and aggregates data into groups:
FINDCov19Tracker::create_shiny_data()
: makes use of coronavirus_cases.csv
and coronavirus_tests.csv
.
The function writes processed/data_all.csv
, which is being used as the input for the Shiny app.