Closed bdacunha closed 8 years ago
@aammd will you keep a special eye out here this week? And answer this one?
Your idea seems to work in general:
library(gapminder)
library(geonames)
library(dplyr)
options(geonamesUsername="insert_username")
countryInfo <- GNcountryInfo()
dataframe1 <- data.frame(area = countryInfo$areaInSqKm,
country = countryInfo$countryName,
continentcode = countryInfo$continent)
gap_joined <- left_join(gapminder, dataframe1)
gap_joined %>% head
Produces:
country continent year lifeExp pop gdpPercap area continentcode
1 Afghanistan Asia 1952 28.801 8425333 779.4453 <NA> <NA>
2 Afghanistan Asia 1957 30.332 9240934 820.8530 <NA> <NA>
3 Afghanistan Asia 1962 31.997 10267083 853.1007 <NA> <NA>
4 Afghanistan Asia 1967 34.020 11537966 836.1971 <NA> <NA>
5 Afghanistan Asia 1972 36.088 13079460 739.9811 <NA> <NA>
6 Afghanistan Asia 1977 38.438 14880372 786.1134 <NA> <NA>
For countries with consistent names, the join works:
gap_joined %>% filter(country == "Canada") %>% head
country continent year lifeExp pop gdpPercap area continentcode
1 Canada Americas 1952 68.75 14785584 11367.16 9984670.0 NA
2 Canada Americas 1957 69.96 17010154 12489.95 9984670.0 NA
3 Canada Americas 1962 71.30 18985849 13462.49 9984670.0 NA
4 Canada Americas 1967 72.13 20819767 16076.59 9984670.0 NA
5 Canada Americas 1972 72.88 22284500 18970.57 9984670.0 NA
6 Canada Americas 1977 74.21 23796400 22090.88 9984670.0 NA
However...the different country names may need a special work around. Lets look for Italy. Surely Italy is in both data sets.
countryInfo$countryName %>% grep("Ital", ., value = TRUE)
[1] "Repubblica Italiana"
:scream_cat:
Thanks Kieran, i will try to use my recently developed knowledge in regex to do this...
You're welcome, but I'm not sure regex will save you here! You might be better off trying to find some way to get the gapminder names to match one of the short format country codes that GNcountryInfo() provides.
Googling provides tantalizing possibilities: https://www.google.ca/webhp?sourceid=chrome-instant&ion=1&espv=2&ie=UTF-8#q=country+name+to+country+code
Thanks, I think I managed to do it... But now I have another question: how can I calculate the urban population to find the relationship between gdp per capita and the proportion of population in urban areas? is it the totalpopulation/urban areas? because I don't seem to find urban population data on geonames. Also, should I consider only one year? or should I work with every year? because I'm assuming that the urban areas will grow over the years, and gapminder has info since 1952, but I checked on world data bank and they only have data since 1980. Maybe I'm overthinking this?
Hello @bdacunha ! This answer is about Merging data on countries
TL;DR use the package countrycode to painlessly combine countries. This was mentioned in the homework!
As you discovered, all the countries of the world have multiple names in multiple languages, creating a regex nightmare.
As @ksamuk pointed out, this is a solved problem. In fact its been solved more than a dozen times, as different organizations have given different unique codes to each country. The most common one is ISO 3166-1.
The output from GNcountryInfo()
contains all of these codes. In fact, the dataframe that function gives you is sorted alphabetically by countryCode
, which is the ISO alpha-2 code for the country. You will recognize these from internet sites identified by country (.ca
, .fr
etc).
Gapminder, on the other hand, does not have any codes ( perhaps an enhancement for the future, @jennybc ?) Let's get those codes in the most painless way.
Based on @ksamuk 's approach:
library(countrycode)
library(gapminder)
library(geonames)
library(dplyr)
countryInfo <- GNcountryInfo()
country_df <- countryInfo %>%
tbl_df %>%
select(area = areaInSqKm,
country = countryName,
continentcode = continent) %>%
mutate(country = as.character(country),
continentcode = as.character(continentcode))
# WHY did countryInfo arrive as all character vectors?!
## why is life so painful?
gapminder_code <- gapminder %>%
mutate(countryCode = countrycode(country, "country.name", "iso2c"))
gap_joined_2 <- gapminder_code %>%
left_join(countryInfo ,
by = "countryCode")
gap_joined_2 %>%
select(countryName, country) %>%
distinct %>%
knitr::kable()
countryName | country |
---|---|
Islamic Republic of Afghanistan | Afghanistan |
Republic of Albania | Albania |
People’s Democratic Republic of Algeria | Algeria |
Republic of Angola | Angola |
Argentine Republic | Argentina |
Commonwealth of Australia | Australia |
Republic of Austria | Austria |
Kingdom of Bahrain | Bahrain |
Bangladesh | Bangladesh |
Kingdom of Belgium | Belgium |
Republic of Benin | Benin |
Plurinational State of Bolivia | Bolivia |
Bosnia and Herzegovina | Bosnia and Herzegovina |
Republic of Botswana | Botswana |
Federative Republic of Brazil | Brazil |
Republic of Bulgaria | Bulgaria |
Burkina Faso | Burkina Faso |
Republic of Burundi | Burundi |
Kingdom of Cambodia | Cambodia |
Republic of Cameroon | Cameroon |
Canada | Canada |
Central African Republic | Central African Republic |
Republic of Chad | Chad |
Republic of Chile | Chile |
People’s Republic of China | China |
Republic of Colombia | Colombia |
Union of the Comoros | Comoros |
Democratic Republic of the Congo | Congo, Dem. Rep. |
Republic of the Congo | Congo, Rep. |
Republic of Costa Rica | Costa Rica |
Côte de l’Ivoire | Cote d'Ivoire |
Republic of Croatia | Croatia |
Republic of Cuba | Cuba |
Czech Republic | Czech Republic |
Kingdom of Denmark | Denmark |
Republic of Djibouti | Djibouti |
Dominican Republic | Dominican Republic |
Republic of Ecuador | Ecuador |
Arab Republic of Egypt | Egypt |
Republic of El Salvador | El Salvador |
Guinee Espagnol | Equatorial Guinea |
State of Eritrea | Eritrea |
Federal Democratic Republic of Ethiopia | Ethiopia |
Republic of Finland | Finland |
France | France |
Gabonese Republic | Gabon |
Gambia | Gambia |
Federal Republic of Germany | Germany |
Republic of Ghana | Ghana |
Hellenic Republic | Greece |
Republic of Guatemala | Guatemala |
Republic of Guinea | Guinea |
Republic of Guinea-Bissau | Guinea-Bissau |
Republic of Haiti | Haiti |
Republic of Honduras | Honduras |
Hong Kong Special Administrative Region | Hong Kong, China |
Hungary | Hungary |
Republic of Iceland | Iceland |
Republic of India | India |
Republic of Indonesia | Indonesia |
Islamic Republic of Iran | Iran |
Republic of Iraq | Iraq |
Ireland | Ireland |
State of Israel | Israel |
Repubblica Italiana | Italy |
Jamaica | Jamaica |
Japan | Japan |
Al Mamlakah al Urduniyah al Hashimiyah | Jordan |
Republic of Kenya | Kenya |
Democratic People’s Republic of Korea | Korea, Dem. Rep. |
Republic of Korea | Korea, Rep. |
State of Kuwait | Kuwait |
Lebanon | Lebanon |
Kingdom of Lesotho | Lesotho |
Republic of Liberia | Liberia |
Libya | Libya |
Republic of Madagascar | Madagascar |
Republic of Malawi | Malawi |
Malaysia | Malaysia |
Republic of Mali | Mali |
Islamic Republic of Mauritania | Mauritania |
Republic of Mauritius | Mauritius |
Mexico | Mexico |
Mongolia | Mongolia |
Montenegro | Montenegro |
Kingdom of Morocco | Morocco |
Republic of Mozambique | Mozambique |
Union of Burma | Myanmar |
Republic of Namibia | Namibia |
Federal Democratic Republic of Nepal | Nepal |
Kingdom of the Netherlands | Netherlands |
New Zealand | New Zealand |
Republic of Nicaragua | Nicaragua |
Republic of Niger | Niger |
Federal Republic of Nigeria | Nigeria |
Kingdom of Norway | Norway |
Sultanate of Oman | Oman |
Islamic Republic of Pakistan | Pakistan |
Republic of Panama | Panama |
Republic of Paraguay | Paraguay |
Republic of Peru | Peru |
Republic of the Philippines | Philippines |
Republic of Poland | Poland |
Portuguese Republic | Portugal |
Puerto Rico | Puerto Rico |
Reunion | Reunion |
România | Romania |
Republic of Rwanda | Rwanda |
Sao Tome and Principe | Sao Tome and Principe |
Kingdom of Saudi Arabia | Saudi Arabia |
Republic of Senegal | Senegal |
Serbia | Serbia |
Republic of Sierra Leone | Sierra Leone |
Republic of Singapore | Singapore |
Slovak Republic | Slovak Republic |
Republic of Slovenia | Slovenia |
Somalia | Somalia |
Republic of South Africa | South Africa |
Kingdom of Spain | Spain |
Democratic Socialist Republic of Sri Lanka | Sri Lanka |
Republic of the Sudan | Sudan |
Kingdom of Swaziland | Swaziland |
Kingdom of Sweden | Sweden |
Switzerland | Switzerland |
Syrian Arab Republic | Syria |
Taiwan | Taiwan |
United Republic of Tanzania | Tanzania |
Kingdom of Thailand | Thailand |
Togolese Republic | Togo |
Republic of Trinidad and Tobago | Trinidad and Tobago |
Republic of Tunisia | Tunisia |
Republic of Turkey | Turkey |
Republic of Uganda | Uganda |
United Kingdom of Great Britain and Northern Ireland | United Kingdom |
United States | United States |
Oriental Republic of Uruguay | Uruguay |
Bolivarian Republic of Venezuela | Venezuela |
Socialist Republic of Vietnam | Vietnam |
Palestine | West Bank and Gaza |
Republic of Yemen | Yemen, Rep. |
Republic of Zambia | Zambia |
Republic of Zimbabwe | Zimbabwe |
Other questions:
because I don't seem to find urban population data on geonames.
check out geonames::GNcities()
. Or did you try that and find that it didn't work? Please let us know if so!
Also, should I consider only one year? or should I work with every year?
Good question. I think it's your call. You may only be able to make this correlation for one year. When a musical remix artist combines tracks, they usually don't use whole songs. Likewise a data remix artist does not always use ALL the data!
Hi Andrew,
Thanks for the thorough answer. for the density question, I made it work using countrycode with "iso3c" to add the country codes to gapminder.
For the other question, I was looking more at GNfindnearbyPlaceName because I saw on geonames that all the urban population data was on the populated places.
I tried using GNcities but wasn't sure on how to procede because I thought GNcities was used to create like a box (specifying north, south east and west) so I have to create that box for every country in gapminder? or should I just pick like 5 countries or so to do this?
Well, both work and you should do whatever you prefer!
That said, note that the output of GNcountryInfo()
contains columns called North
South
East
and West
, which give the coordinates of the extreme points of each country. So no need to create boxes manually.
Thanks Andrew, I think I got it now!
Hello, I'm currently trying to start my homework 10, but I got stucked trying to put gapminder and geonomes together:
I have this code, that takes the area, country and continent from geonames, but I get the continent codes instead of names and for the countries I get different names than gapminder, so when I try to join the two dataframes using semi_join, or something like that, it won't work. Any recommendation for this?
Thanks,
Brenda