CSSEGISandData / COVID-19

Novel Coronavirus (COVID-19) Cases, provided by JHU CSSE
https://systems.jhu.edu/research/public-health/ncov/
29.15k stars 18.46k forks source link

Mismatch in number of rows #2351

Open sgunthe opened 4 years ago

sgunthe commented 4 years ago

While I am trying to read the raw data to run a model there is a slight mismatch in the sequence/number of rows. For example in files time_series_covid19_confirmed_global.csv and time_series_covid19_deaths_global.csv "India" is at row 133 but in file time_series_covid19_recovered_global.csv "India" is at row 127. While reading the same in matlab it is creating some problem. Can someone help in this regard to fix this. Appreciate the help in advance.

Thanks and regards,

Stay healthy and be safe.

Dialvec commented 4 years ago

Hi. I've benn having the same issue. There is, among others, one tricky step between recovered and the other two tables.

If you check for Canada's data, you will find out that recovered_global counts Canada as a single territory, but in the other two files you can find Canada divided as different cores.

With that in mind, if you want to operate the tables based on row number, you will have to summarize Canada in Confirmed and recovered files. Also, you will have to reorder the rows to mach countries.

Pd: Check the end of the tables. There is tricky info there. Pd2: I suggest youto use pimary keys different than row number. You will reduce drastically the time spent on this kind of issues