This repo contains several utilities for wrangling COVID-19 data from the John Hopkins University COVID-19 repository.
NOTE: The utilities currently do not work because of the new file formats. They will be updated shortly to work with the revised formats.
A note on cloning this repo, since the COVID19 directory is a git submodule:
git submodule init
and git submodule update
to clone the JHU Repo as a submodule The files in this directory and how they're used:
covid-19_ingest.sh
: script that converts the JHU COVID-19 daily-report data to a time-series database using TimescaleDB.covid-refine
: OpenRefine automation script that converts JHU COVID-19 time-series data into a normalized, enriched format and uploads it to TimescaleDB. (RECOMMENDED)schema.sql
: Data definition (DDL) to create the necessary tables & hypertables.environment
: Default environment values used in Docker containers.covid_19
, and an application user covid19_user
psql
create database covid_19;
create user covid19_user WITH PASSWORD 'your-password-here';
alter database covid_19 OWNER TO covid19_user;
\quit
Run schema.sql
as the covid19_user
. VACUUM/ANALYZE require owner privs
psql -U covid19_user -h <the.server.hostname> -f schema.sql covid_19
Install csvkit
sudo apt-get install csvkit
brew install csvkit
Using a text editor, replace the environment variables for PGHOST
, PGUSER
and PGPASSWORD
in covid-19_ingest.sh
Run the script
bash covid-19_ingest.sh
(OPTIONAL) add shell script to crontab to run daily
Be able to slice-and-dice the data using the full power of PostgreSQL along with Timescale's time-series capabilities!
NOTE: Due to the changing file format of JHU's daily report data, covid-refine is recommended over covid-19_ingest.sh
.
COVIDrefine has the added benefit of producing fully normalized, non-sparse, geo-enriched data.
See the detailed README.
If you just want to download the COVIDrefine data, the latest version can be found here.
git submodule init
docker-compose build
docker-compose up
~/.covid-19
in your home directory.
-covid-19_ingest.sh
checkslastcsvprocessed
. Delete that file to process all daily-report files from the beginning, or change the date in the file to start processing files AFTER the entered date. create a Superset visualization
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.