PNNL-PREMIS / SULI2019

Repository for summer 2019 SULI interns
0 stars 0 forks source link

R Workshop: Part 3 Exercises #22

Closed stephpenn1 closed 5 years ago

stephpenn1 commented 5 years ago

Please post your answers in one comment below. And do take advantage of the formatting tools available when writing comments (https://help.github.com/en/articles/basic-writing-and-formatting-syntax) for readability. Have fun and slack us on the #suli-rstats channel if you need help!

Packages Needed:

dplyr tidyr ggplot2 gapminder - data package of life expectancy, GDP per capita, and population for 142 countries

Question 1:

What is the range (hint, hint) of the years in this dataset?

Question 2:

gapminder %>% 
    print(n = 20) %>%  
    summary()

How would you do this in base R (without pipelines)? Think about how functions are structured (function(argument))

Question 3:

Write a pipeline that prints the average life expectancy for each continent in the last year of the dataset. Note that to do this correctly, you’ll need to weight by country populations. Paste the tibble/dataframe, also note you can use ``` on either side of text to format code in a comment.

see! oooo hello I'm code

Question 4:

Write a pipeline that picks out China and returns only the year and lifeExp columns and plot the life expectancy over time. Check out the select() function in dplyr.

Extra Credit Pipeline:

Write a pipeline that computes the year of max population for each country.

Plotting Challenge:

Using the full gapminder dataset, reproduce this plot. Note: the x-axis looks like it's been scaled.

Hint: First look at what data is being shown. Has it been filtered? What variables are plotted and how? image

haileymoore commented 5 years ago

Question 1:

The range of years in gapminder is 1952-2007. To find this, use the code: print(range(gapminder$year))

Question 2:

gapminder %>% 
  print(n = 20) %>%  
  summary()

This statement can be recreated in base R using the following line: summary(print(gapminder, n=20))

Question 3:

To find the average life expectancy, we must first filter to only the last year of the data set, 2007. We can then group by continent, and then use the summarise function to find the average life expectancy using the base R function weighted.mean:

weighted_life_exp <- gapminder %>% 
  filter(year==2007) %>%
  group_by(continent) %>% 
  summarise(avg_life_Exp=weighted.mean(lifeExp,pop))

Question 4:

We first filter the data set to only include China. We then use select() to extract only the year and lifeExp columns. This is then put in to ggplot() to create a plot of life expectancy over time.

china_life_exp <- gapminder %>%
  filter(country=='China') %>% 
  select(lifeExp, year)

ch_life_exp_time <- ggplot(china_life_exp, aes(year,lifeExp)) + 
  geom_point() +
  labs(x='Year', y='Life Expecancy (yrs)', title='Life Expectancy in China') +
  theme_bw()

image

Extra Credit Pipeline:

Summarise to find the maximum population of each country:

max_pop <- gapminder %>%
  group_by(country) %>%
  summarise(maxpop=max(pop))

Plotting Challenge:

The plot is filtered to the year 1967, with gdpPercap on the x axis and lifeExp on the y axis. The x-axis is on a logarithmic scale. The points are colored by continent and sized by population.

plot_recreate <- gapminder %>%
  filter(year==1967) %>%
  ggplot(aes(gdpPercap,lifeExp,color=continent,size=pop)) +
  geom_point() +
  scale_x_log10() +
  labs(x='GDP per capita', y='life expectancy', title='Year 1967', subtitle='Gapminder Dataset') +
  theme_bw()

image

bpbond commented 5 years ago

@stephpenn1 is the grader on this one but 👏 @hmoore28 . One comment though, the extra pipeline is

Write a pipeline that computes the year of max population for each country.

marideeweber commented 5 years ago

Question 1

print(range(gapminder$year))

2007-1952 55 years

Question 2

print(summary(gapminder, n = 20))

Question 3

gapminder %>%
  filter(year == 2007) %>%
  group_by(continent) %>% 
  summarise(avg_life_Exp = weighted.mean(lifeExp, pop)) -> avg_life_exp
print(avg_life_exp)

# A tibble: 5 x 2
  continent avg_life_Exp
  <fct>            <dbl>
1 Africa            54.6
2 Americas          75.4
3 Asia              69.4
4 Europe            77.9
5 Oceania           81.1

Question 4

gapminder %>%
  filter(country == "China") %>% 
  select(year, lifeExp) -> china_life
print(china_life)

Weber China

Plotting Challenge Weber 1967

lilliehaddock commented 5 years ago

Question 1

The range of years in this dataset is 1952-2007.

print(range(gapminder$year))

Question 2

The base R version of the 20 row gapminder summary is:

print(summary(gapminder, n = 20))

Question 3

The average life expectancy for each continent weighted by country population in the last year is determined by filtering the data by the last year in the dataset, 2007, grouping by continent, then by using the weighted.mean function.

life_expectancy <- gapminder %>% 
  filter(year == 2007) %>% 
  group_by(continent) %>% 
  summarise(weighted_life_exp = weighted.mean(lifeExp, pop))

Question 4

Pipeline that picks out China and returns only the year and lifeExp columns:

china_life_exp_over_time <- gapminder %>% 
  filter(country == "China") %>% 
  select(year, lifeExp)

Plot of life expectancy over time:

plot_china_life_exp <- china_life_exp_over_time %>% 
  ggplot(aes(year, lifeExp)) +
  geom_point() +
  labs(title = "China's Life Expectancy over Time", y = "Life Expectancy")
print(plot_china_life_exp)

china life expectancy

Extra credit pipeline

help!

Plotting Challenge

gdp_vs_life_exp <- gapminder %>% 
  filter(year == 1967) %>% 
  ggplot(aes(gdpPercap, lifeExp, color = continent, size = pop)) +
  geom_point() +
  theme_bw() +
  scale_x_log10() +
  labs(title = "Year 1967", subtitle = "Gapminder Dataset", x = "GDP per capita", y = "life expectancy")
print(gdp_vs_life_exp)

year 1967