CSSEGISandData / COVID-19

Novel Coronavirus (COVID-19) Cases, provided by JHU CSSE
https://systems.jhu.edu/research/public-health/ncov/
29.14k stars 18.44k forks source link

keep the shape of the data consistent #6

Closed tendersoft-mjb closed 4 years ago

tendersoft-mjb commented 4 years ago

A column in the time_series_2019-ncov-Confirmed.csv used to be named 'First confirmed date in country (est.)' but now is 'First confirmed date in country' - this small change brakes all the downstream analytics.

Besides, the column name is misleading since it contains dates of first confirmed cases in either state/province or in country - depending on which is the smaller administrative unit.

There could be 2 columns:

  1. First confirmed date in province/country <-- with data from current 'First confirmed date in country' column
  2. First confirmed date in country <-- optional, preserves compatibility

The 1st one showing data from former/current column 'First confirmed date in country (est.)'/'First confirmed date in country'.

The 2nd one showing actual first date for the country as a whole. The column is not strictly necessary since people who need it, will add it on their side, but it would preserve backward compatibility with existing analytical solutions.

Either way kindly please keep the names and data consistent because it causes errors and confusion in the analytic pipeline down the line.

CSSEGISandData commented 4 years ago

There have been a few complaints about this field. We have removed it from all of the time series data, due to its misleading and possibly inaccurate values. Sorry for the confusion.

Bost commented 4 years ago

Guys, you just inserted a new sheet "Announcement" to the google spreadsheet. That effectively breaks any algorithm trying to parse the datetime information from such a sheet name. PLEASE: If you want to convey some new information via the spreadsheet, then do place it in some unused location i.e. previously empty column! Thank you.

tendersoft-mjb commented 4 years ago

@CSSEGISandData IMHO removing the column completely is the worst possible solution, because

  1. the shape of the data is broken once more
  2. some types of analysis are now impossible

For example, with the date of patient 0 in each area we could track the growth rate of local epicemic - see below: 2019-nCoV_confirmed_cases_since_day0_per_Country20200208 Note: China is on the right Y-axis, all other on the left Y-axis.

Sure, there are errors in the data like date of first reported case is later than the earliest date from daily updates, hence the -1 on X-axis. Still, it's very useful to know if local epidemics are progressing slower or faster than the main one.

The 'First confirmed date in province/country' column is useful especially for Chinese data, where we do not have the daily updates for at least 18 days after patient 0 was diagnosed.

I encourage you to add this column back. However, to keep the shape of the data from changing again you could add another spreadsheet/CSV with just the following data:

Province/State Country/Region First confirmed date in province/country First confirmed date in country Lat Long
Anhui Mainland China 22.01.2020 03.01.2020 31,82571 117,2264
Beijing Mainland China 21.01.2020 03.01.2020 40,18238 116,4142
Zhejiang Mainland China 21.01.2020 03.01.2020 29,18251 120,0985
  Thailand 21.01.2020 21.01.2020 13,7563 100,5018
  Japan 21.01.2020 21.01.2020 35,6762 139,6503

This worksheet would be mostly static. Changing only if and when a new area gets infected. All other worksheets/CSVs could be updated almost independently from this one.