STAT545-UBC / Discussion

Public discussion
38 stars 20 forks source link

homework 10 #278

Closed bdacunha closed 8 years ago

bdacunha commented 8 years ago

Hello, I'm currently trying to start my homework 10, but I got stucked trying to put gapminder and geonomes together:

countryInfo <- GNcountryInfo()
dataframe1 <- cbind.data.frame(area = countryInfo$areaInSqKm, country = countryInfo$countryName, continentcode = countryInfo$continent)
dataframetbl <- tbl_df(dataframe1)

I have this code, that takes the area, country and continent from geonames, but I get the continent codes instead of names and for the countries I get different names than gapminder, so when I try to join the two dataframes using semi_join, or something like that, it won't work. Any recommendation for this?

Thanks,

Brenda

jennybc commented 8 years ago

@aammd will you keep a special eye out here this week? And answer this one?

ksamuk commented 8 years ago

Your idea seems to work in general:

library(gapminder)
library(geonames)
library(dplyr)

options(geonamesUsername="insert_username")

countryInfo <- GNcountryInfo()
dataframe1 <- data.frame(area = countryInfo$areaInSqKm, 
                         country = countryInfo$countryName, 
                         continentcode = countryInfo$continent)

gap_joined <- left_join(gapminder, dataframe1) 

gap_joined %>% head

Produces:

      country continent year lifeExp      pop gdpPercap area continentcode
1 Afghanistan      Asia 1952  28.801  8425333  779.4453 <NA>          <NA>
2 Afghanistan      Asia 1957  30.332  9240934  820.8530 <NA>          <NA>
3 Afghanistan      Asia 1962  31.997 10267083  853.1007 <NA>          <NA>
4 Afghanistan      Asia 1967  34.020 11537966  836.1971 <NA>          <NA>
5 Afghanistan      Asia 1972  36.088 13079460  739.9811 <NA>          <NA>
6 Afghanistan      Asia 1977  38.438 14880372  786.1134 <NA>          <NA>

For countries with consistent names, the join works:

gap_joined %>% filter(country == "Canada") %>% head
  country continent year lifeExp      pop gdpPercap      area continentcode
1  Canada  Americas 1952   68.75 14785584  11367.16 9984670.0            NA
2  Canada  Americas 1957   69.96 17010154  12489.95 9984670.0            NA
3  Canada  Americas 1962   71.30 18985849  13462.49 9984670.0            NA
4  Canada  Americas 1967   72.13 20819767  16076.59 9984670.0            NA
5  Canada  Americas 1972   72.88 22284500  18970.57 9984670.0            NA
6  Canada  Americas 1977   74.21 23796400  22090.88 9984670.0            NA

However...the different country names may need a special work around. Lets look for Italy. Surely Italy is in both data sets.

countryInfo$countryName %>% grep("Ital", ., value = TRUE)
[1] "Repubblica Italiana"

:scream_cat:

bdacunha commented 8 years ago

Thanks Kieran, i will try to use my recently developed knowledge in regex to do this...

ksamuk commented 8 years ago

You're welcome, but I'm not sure regex will save you here! You might be better off trying to find some way to get the gapminder names to match one of the short format country codes that GNcountryInfo() provides.

Googling provides tantalizing possibilities: https://www.google.ca/webhp?sourceid=chrome-instant&ion=1&espv=2&ie=UTF-8#q=country+name+to+country+code

bdacunha commented 8 years ago

Thanks, I think I managed to do it... But now I have another question: how can I calculate the urban population to find the relationship between gdp per capita and the proportion of population in urban areas? is it the totalpopulation/urban areas? because I don't seem to find urban population data on geonames. Also, should I consider only one year? or should I work with every year? because I'm assuming that the urban areas will grow over the years, and gapminder has info since 1952, but I checked on world data bank and they only have data since 1980. Maybe I'm overthinking this?

aammd commented 8 years ago

Hello @bdacunha ! This answer is about Merging data on countries

TL;DR use the package countrycode to painlessly combine countries. This was mentioned in the homework!

As you discovered, all the countries of the world have multiple names in multiple languages, creating a regex nightmare.

As @ksamuk pointed out, this is a solved problem. In fact its been solved more than a dozen times, as different organizations have given different unique codes to each country. The most common one is ISO 3166-1.

The output from GNcountryInfo() contains all of these codes. In fact, the dataframe that function gives you is sorted alphabetically by countryCode, which is the ISO alpha-2 code for the country. You will recognize these from internet sites identified by country (.ca, .fr etc).

Gapminder, on the other hand, does not have any codes ( perhaps an enhancement for the future, @jennybc ?) Let's get those codes in the most painless way.

Based on @ksamuk 's approach:

library(countrycode)
library(gapminder)
library(geonames)
library(dplyr)

countryInfo <- GNcountryInfo()
country_df <- countryInfo %>% 
  tbl_df %>%
  select(area = areaInSqKm, 
         country = countryName, 
         continentcode = continent) %>%
  mutate(country = as.character(country),
         continentcode = as.character(continentcode))

# WHY did countryInfo arrive as all character vectors?!
## why is life so painful?

gapminder_code <- gapminder %>%
  mutate(countryCode = countrycode(country, "country.name", "iso2c"))

gap_joined_2 <- gapminder_code %>%
  left_join(countryInfo ,
            by = "countryCode")

gap_joined_2 %>%
  select(countryName, country) %>%
  distinct %>%
  knitr::kable()
countryName country
Islamic Republic of Afghanistan Afghanistan
Republic of Albania Albania
People’s Democratic Republic of Algeria Algeria
Republic of Angola Angola
Argentine Republic Argentina
Commonwealth of Australia Australia
Republic of Austria Austria
Kingdom of Bahrain Bahrain
Bangladesh Bangladesh
Kingdom of Belgium Belgium
Republic of Benin Benin
Plurinational State of Bolivia Bolivia
Bosnia and Herzegovina Bosnia and Herzegovina
Republic of Botswana Botswana
Federative Republic of Brazil Brazil
Republic of Bulgaria Bulgaria
Burkina Faso Burkina Faso
Republic of Burundi Burundi
Kingdom of Cambodia Cambodia
Republic of Cameroon Cameroon
Canada Canada
Central African Republic Central African Republic
Republic of Chad Chad
Republic of Chile Chile
People’s Republic of China China
Republic of Colombia Colombia
Union of the Comoros Comoros
Democratic Republic of the Congo Congo, Dem. Rep.
Republic of the Congo Congo, Rep.
Republic of Costa Rica Costa Rica
Côte de l’Ivoire Cote d'Ivoire
Republic of Croatia Croatia
Republic of Cuba Cuba
Czech Republic Czech Republic
Kingdom of Denmark Denmark
Republic of Djibouti Djibouti
Dominican Republic Dominican Republic
Republic of Ecuador Ecuador
Arab Republic of Egypt Egypt
Republic of El Salvador El Salvador
Guinee Espagnol Equatorial Guinea
State of Eritrea Eritrea
Federal Democratic Republic of Ethiopia Ethiopia
Republic of Finland Finland
France France
Gabonese Republic Gabon
Gambia Gambia
Federal Republic of Germany Germany
Republic of Ghana Ghana
Hellenic Republic Greece
Republic of Guatemala Guatemala
Republic of Guinea Guinea
Republic of Guinea-Bissau Guinea-Bissau
Republic of Haiti Haiti
Republic of Honduras Honduras
Hong Kong Special Administrative Region Hong Kong, China
Hungary Hungary
Republic of Iceland Iceland
Republic of India India
Republic of Indonesia Indonesia
Islamic Republic of Iran Iran
Republic of Iraq Iraq
Ireland Ireland
State of Israel Israel
Repubblica Italiana Italy
Jamaica Jamaica
Japan Japan
Al Mamlakah al Urduniyah al Hashimiyah Jordan
Republic of Kenya Kenya
Democratic People’s Republic of Korea Korea, Dem. Rep.
Republic of Korea Korea, Rep.
State of Kuwait Kuwait
Lebanon Lebanon
Kingdom of Lesotho Lesotho
Republic of Liberia Liberia
Libya Libya
Republic of Madagascar Madagascar
Republic of Malawi Malawi
Malaysia Malaysia
Republic of Mali Mali
Islamic Republic of Mauritania Mauritania
Republic of Mauritius Mauritius
Mexico Mexico
Mongolia Mongolia
Montenegro Montenegro
Kingdom of Morocco Morocco
Republic of Mozambique Mozambique
Union of Burma Myanmar
Republic of Namibia Namibia
Federal Democratic Republic of Nepal Nepal
Kingdom of the Netherlands Netherlands
New Zealand New Zealand
Republic of Nicaragua Nicaragua
Republic of Niger Niger
Federal Republic of Nigeria Nigeria
Kingdom of Norway Norway
Sultanate of Oman Oman
Islamic Republic of Pakistan Pakistan
Republic of Panama Panama
Republic of Paraguay Paraguay
Republic of Peru Peru
Republic of the Philippines Philippines
Republic of Poland Poland
Portuguese Republic Portugal
Puerto Rico Puerto Rico
Reunion Reunion
România Romania
Republic of Rwanda Rwanda
Sao Tome and Principe Sao Tome and Principe
Kingdom of Saudi Arabia Saudi Arabia
Republic of Senegal Senegal
Serbia Serbia
Republic of Sierra Leone Sierra Leone
Republic of Singapore Singapore
Slovak Republic Slovak Republic
Republic of Slovenia Slovenia
Somalia Somalia
Republic of South Africa South Africa
Kingdom of Spain Spain
Democratic Socialist Republic of Sri Lanka Sri Lanka
Republic of the Sudan Sudan
Kingdom of Swaziland Swaziland
Kingdom of Sweden Sweden
Switzerland Switzerland
Syrian Arab Republic Syria
Taiwan Taiwan
United Republic of Tanzania Tanzania
Kingdom of Thailand Thailand
Togolese Republic Togo
Republic of Trinidad and Tobago Trinidad and Tobago
Republic of Tunisia Tunisia
Republic of Turkey Turkey
Republic of Uganda Uganda
United Kingdom of Great Britain and Northern Ireland United Kingdom
United States United States
Oriental Republic of Uruguay Uruguay
Bolivarian Republic of Venezuela Venezuela
Socialist Republic of Vietnam Vietnam
Palestine West Bank and Gaza
Republic of Yemen Yemen, Rep.
Republic of Zambia Zambia
Republic of Zimbabwe Zimbabwe
aammd commented 8 years ago

Other questions:

because I don't seem to find urban population data on geonames.

check out geonames::GNcities(). Or did you try that and find that it didn't work? Please let us know if so!

Also, should I consider only one year? or should I work with every year?

Good question. I think it's your call. You may only be able to make this correlation for one year. When a musical remix artist combines tracks, they usually don't use whole songs. Likewise a data remix artist does not always use ALL the data!

bdacunha commented 8 years ago

Hi Andrew,

Thanks for the thorough answer. for the density question, I made it work using countrycode with "iso3c" to add the country codes to gapminder.

For the other question, I was looking more at GNfindnearbyPlaceName because I saw on geonames that all the urban population data was on the populated places.

I tried using GNcities but wasn't sure on how to procede because I thought GNcities was used to create like a box (specifying north, south east and west) so I have to create that box for every country in gapminder? or should I just pick like 5 countries or so to do this?

aammd commented 8 years ago

Well, both work and you should do whatever you prefer! That said, note that the output of GNcountryInfo() contains columns called North South East and West, which give the coordinates of the extreme points of each country. So no need to create boxes manually.

bdacunha commented 8 years ago

Thanks Andrew, I think I got it now!