STAT545-UBC / Discussion

Public discussion
38 stars 20 forks source link

do() won't work #288

Closed bdacunha closed 8 years ago

bdacunha commented 8 years ago

Hi,

I (with Kieran's help) have this function:

get_urban_population <- function(country_code){
  countrynames <- GNcountryInfo()
  countrybox <- countrynames %>% 
    filter(countryCode == country_code)

  country_cities <- GNcities(north = countrybox$north, 
                         south = countrybox$south,
                         east = countrybox$east,
                         west = countrybox$west,
                         maxRows = 500) 

  country_cities <- country_cities %>%
    filter(countrycode == country_code)

  urban_pop <- country_cities  %>%
    select(population) %>%
    unlist %>% 
    as.numeric %>% 
    sum
  return(urban_pop)
}

But when I try to apply it to gapminder, it won't work... (my gap2000 has the country_code added as iso2c and it's filtered for year 2002 only)

gap2002 <- gapminder %>% 
  filter(year == "2002")
head(gap2002)
urban_pop <- gap2002 %>%
  select(countrycode) %>% 
  group_by(countrycode) %>% 
  do(get_urban_population(.))

I get the following error:

Error in if (!repeated && grepl("%[[:xdigit:]]{2}", URL, useBytes = TRUE)) return(URL) : missing value where TRUE/FALSE needed

Can someone help me with this?? thanks!!!!

Brenda

ksamuk commented 8 years ago

group_by breaks the data frame (tbl_df) into groups of data frames. So, in this case, you are passing entire data frames (with a single row in this case, I guess) do the get_urban_population function.

So I think might merely need to replace the last line with do(get_urban_population(.$countrycode)). Then you'll need to join that back into the gap2002 data frame. So you might need some tweaks to smooth that transition.

bdacunha commented 8 years ago

It takes forever to run on the whole gapminder data... and then It will show an error... I tried creating a small set of countries and running the function but still won't work...

my_country_list <- c("Kuwait", "Libya", "Gabon", "Saudi Arabia", "Ireland")
gap2002 <- gapminder %>% subset(country %in% my_country_list) %>% 
  filter(year == "2002")
gap2002
urban_pop <- gap2002 %>%
  select(country, continent, gdpPercap, countrycode) %>% 
  group_by(countrycode) %>% 
do(get_urban_population(.$countrycode))

I get this Error: Results are not data frames at positions: 1, 2, 3, 4, 5

I tried to change the return of the function to a data frame: return(data.frame(urban_pop)) and it works for one country, but when I try it for every country or for my set of countries it doesn't show any error but doesn't show any output either...it just stays there in blank

ksamuk commented 8 years ago

OK, a few problems. I didn't notice in your code, but the output of do needs to be assigned a (column) name, just like summarise. Also, if you don't want a weird list column with a single value, you need to wrap the whole do expression inside of data.frame.

Secondly, it will speed things up to just query the country you want each time, vs. getting everything from geonames and filtering it every time. So, have a look at the (reproducible) modifications below:

library(dplyr)
library(geonames)
library(gapminder)

options(geonamesUsername="insert_user_name")

get_urban_population <- function(country_code){
  countrybox <- GNcountryInfo(country = country_code)

  country_cities <- GNcities(north = countrybox$north, 
                             south = countrybox$south,
                             east = countrybox$east,
                             west = countrybox$west,
                             maxRows = 500) 

  country_cities <- country_cities %>%
    filter(countrycode == country_code)

  urban_pop <- country_cities  %>%
    select(population) %>%
    unlist %>% 
    as.numeric %>% 
    sum
  return(urban_pop)
}

my_country_list <- c("Kuwait", "Libya", "Gabon", "Saudi Arabia", "Ireland")

gap2002 <- gapminder %>% 
subset(country %in% my_country_list) %>% 
  filter(year == "2002")

gap2002$countrycode <- c("GA", "IE", "KW", "LB", "SA")

urban_pop <- gap2002 %>%
  select(country, continent, gdpPercap, countrycode) %>% 
  group_by(country) %>%
  do(data.frame(urban_pop = get_urban_population(.$countrycode)))

urban_pop
Source: local data frame [5 x 2]
Groups: country [5]

       country urban_pop
        (fctr)     (dbl)
1        Gabon    932660
2      Ireland   2177007
3       Kuwait    986889
4        Libya   2859178
5 Saudi Arabia  12357892
bdacunha commented 8 years ago

Thank you soo much!! It works fine now!!

jennybc commented 8 years ago

I'm having an OCD moment but if you're filtering already, you can eliminate the subset statement:

gapminder %>% 
  filter(country %in% my_country_list, year == "2002")

I'm surprised year is character? Also a join or match would be a safer way to bring those two letter country codes in. Less likely to create a puzzle when/if you scale up.

ksamuk commented 8 years ago

Right, I just added the country codes manually so the example would be reproducible. I assume Brenda has some other (unstated) method for this.