CSSEGISandData / COVID-19

Novel Coronavirus (COVID-19) Cases, provided by JHU CSSE
https://systems.jhu.edu/research/public-health/ncov/
29.14k stars 18.46k forks source link

Perhaps the population data file - UID_ISO_FIPS_LookUp_Table - has some incorrect information #2748

Closed armsp closed 4 years ago

armsp commented 4 years ago

I aggregated the population per Country_Region from the population file and compared it with the Worldometer's population data

import pandas as pd
population_uri = 'https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/UID_ISO_FIPS_LookUp_Table.csv'
population_d = pd.read_csv(population_uri)
population_d.groupby('Country_Region').aggregate({'Country_Region': 'first', 'Population': 'sum'})
I see some glaring differences - Country JHU Worldometer
US 997,376,806 331,002,651
Russia 292,579,178 145,934,462
India 2,751,276,819 1,380,004,385
Canada 75,711,404 37,742,154
China 2,809,352,660 1,439,323,776

Any idea why JUH's population data is often many times more than the actual population? For quite a few of the other countries the data matches. But some of the largest countries, its all over the place.

CSSEGISandData commented 4 years ago

@armsp I believe the discrepancy is due to a duplication of counts in the aggregation in locations for which there is subnational data. For example, for the US we have population counts at the county, state, and national level. So if you aggregate by the Country_Region value, you will wind up tripling the count for the US. Canada similarly has national and provincial population figures, which is likely why you see roughly a doubling when aggregating.

armsp commented 4 years ago

@CSSEGISandData You are absolutely right. I should have investigated further. Got the correct results with just the following -

population_d[population_d['Country_Region'] == population_d['Combined_Key']]