Longtime listener, first-time caller... I've been processing the raw Daily Reports (https://github.com/CSSEGISandData/COVID-19/tree/master/csse_covid_19_data/csse_covid_19_daily_reports) for a month now and wanted to share a couple data quality reports (i.e., exception reports) that I've created that are automatically generated in Python, and which identify several of the missing/duplicated data issues that I've seen posted here. Both reports are attached and linked in this post.
identifies state records that appeared and subsequently disappeared
identifies all records for which state data (either cases or deaths) were duplicated across two or more consecutive days
identifies all days on which either the state's total case count or total deaths count decreased
demonstrates the incidence of "Unassigned" records and evaluates the ratio of "Unassigned" cases and deaths to current case rates to gauge which states have a bottleneck of records awaiting assignment to a county
identifies all "Out of state" cases and deaths and evaluates these for each state and with respect to the daily case rates and death rates, again to identify potential bottlenecks in state and/or county assignment
identifies counties that are no longer reporting data (i.e., do not appear in the most recent raw CSV file), and demonstrates the last date on which the county appeared, as well as its most recent total cases and total deaths
identifies all counties whose case and death counts were duplicated across two or more consecutive dates (so long as the county's 7-day moving average for case/death rates is above 10)
identifies all counties whose case or death rates are negative (which is impossible), including the date on which the rate anomaly was observed, and the number of cases or deaths that were effectively removed from the county's cumulative count
Longtime listener, first-time caller... I've been processing the raw Daily Reports (https://github.com/CSSEGISandData/COVID-19/tree/master/csse_covid_19_data/csse_covid_19_daily_reports) for a month now and wanted to share a couple data quality reports (i.e., exception reports) that I've created that are automatically generated in Python, and which identify several of the missing/duplicated data issues that I've seen posted here. Both reports are attached and linked in this post.
The first report (https://www.linkedin.com/posts/troy-hughes-27a998a8_covid-19-jhu-daily-reports-data-quality-activity-6671173160479547392-qUWD) evaluates the structure and content of the CSV files, including:
The second report (https://www.linkedin.com/posts/troy-hughes-27a998a8_jhu-daily-reports-covid-19-longitudinal-activity-6672149643671035904-fKAM) evaluates both state-level and county-level data (including cases and deaths) longitudinally (i.e., a between-rows comparison), including:
Appreciate any and all feedback, and as these reports are 100% automated, please advise if anyone would like an updated version. JHU_COVID-19_Daily_Reports_US_Data_Quality_Report_20200528.pdf
JHU_COVID-19_Daily_Reports_US_Longitudinal_Data_Quality_Report_20200528.pdf