keep the shape of the data consistent

tendersoft-mjb commented 4 years ago

A column in the time_series_2019-ncov-Confirmed.csv used to be named 'First confirmed date in country (est.)' but now is 'First confirmed date in country' - this small change brakes all the downstream analytics.

Besides, the column name is misleading since it contains dates of first confirmed cases in either state/province or in country - depending on which is the smaller administrative unit.

There could be 2 columns:

First confirmed date in province/country <-- with data from current 'First confirmed date in country' column
First confirmed date in country <-- optional, preserves compatibility

The 1st one showing data from former/current column 'First confirmed date in country (est.)'/'First confirmed date in country'.

The 2nd one showing actual first date for the country as a whole. The column is not strictly necessary since people who need it, will add it on their side, but it would preserve backward compatibility with existing analytical solutions.

Either way kindly please keep the names and data consistent because it causes errors and confusion in the analytic pipeline down the line.

CSSEGISandData commented 4 years ago

There have been a few complaints about this field. We have removed it from all of the time series data, due to its misleading and possibly inaccurate values. Sorry for the confusion.

Bost commented 4 years ago

Guys, you just inserted a new sheet "Announcement" to the google spreadsheet. That effectively breaks any algorithm trying to parse the datetime information from such a sheet name. PLEASE: If you want to convey some new information via the spreadsheet, then do place it in some unused location i.e. previously empty column! Thank you.

tendersoft-mjb commented 4 years ago

@CSSEGISandData IMHO removing the column completely is the worst possible solution, because

the shape of the data is broken once more
some types of analysis are now impossible

For example, with the date of patient 0 in each area we could track the growth rate of local epicemic - see below: 2019-nCoV_confirmed_cases_since_day0_per_Country20200208 Note: China is on the right Y-axis, all other on the left Y-axis.

Sure, there are errors in the data like date of first reported case is later than the earliest date from daily updates, hence the -1 on X-axis. Still, it's very useful to know if local epidemics are progressing slower or faster than the main one.

The 'First confirmed date in province/country' column is useful especially for Chinese data, where we do not have the daily updates for at least 18 days after patient 0 was diagnosed.

I encourage you to add this column back. However, to keep the shape of the data from changing again you could add another spreadsheet/CSV with just the following data:

Province/State	Country/Region	First confirmed date in province/country	First confirmed date in country	Lat	Long
Anhui	Mainland China	22.01.2020	03.01.2020	31,82571	117,2264
Beijing	Mainland China	21.01.2020	03.01.2020	40,18238	116,4142
Zhejiang	Mainland China	21.01.2020	03.01.2020	29,18251	120,0985
	Thailand	21.01.2020	21.01.2020	13,7563	100,5018
	Japan	21.01.2020	21.01.2020	35,6762	139,6503

This worksheet would be mostly static. Changing only if and when a new area gets infected. All other worksheets/CSVs could be updated almost independently from this one.

CSSEGISandData / COVID-19

keep the shape of the data consistent #6