Perhaps the population data file - UID_ISO_FIPS_LookUp_Table - has some incorrect information

armsp commented 4 years ago

I aggregated the population per Country_Region from the population file and compared it with the Worldometer's population data

import pandas as pd
population_uri = 'https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/UID_ISO_FIPS_LookUp_Table.csv'
population_d = pd.read_csv(population_uri)
population_d.groupby('Country_Region').aggregate({'Country_Region': 'first', 'Population': 'sum'})

I see some glaring differences -	Country	JHU
US	997,376,806	331,002,651
Russia	292,579,178	145,934,462
India	2,751,276,819	1,380,004,385
Canada	75,711,404	37,742,154
China	2,809,352,660	1,439,323,776

Any idea why JUH's population data is often many times more than the actual population? For quite a few of the other countries the data matches. But some of the largest countries, its all over the place.

CSSEGISandData commented 4 years ago

@armsp I believe the discrepancy is due to a duplication of counts in the aggregation in locations for which there is subnational data. For example, for the US we have population counts at the county, state, and national level. So if you aggregate by the Country_Region value, you will wind up tripling the count for the US. Canada similarly has national and provincial population figures, which is likely why you see roughly a doubling when aggregating.

armsp commented 4 years ago

@CSSEGISandData You are absolutely right. I should have investigated further. Got the correct results with just the following -

population_d[population_d['Country_Region'] == population_d['Combined_Key']]

CSSEGISandData / COVID-19

Perhaps the population data file - UID_ISO_FIPS_LookUp_Table - has some incorrect information #2748