daramireh / rfordatasciencebook

0 stars 0 forks source link

Part II #4

Open daramireh opened 2 years ago

daramireh commented 2 years ago

library(tidyverse) vignette("tibble") library(tibble)

Using tibble as a data frame.

as_tibble(iris)

creating a new tibble from strings

tibble( x = 1:5, y = 1, z = x^2 + y )

using tribble to created a new tibble

tribble( ~x,~y,~z, "a", 2, 3.6, "b", 1, 8.5 )

Difference between data frame and tibble

printing: tibble only print the firts 10 rows

tibble( a = lubridate::now() + runif(1e3) 86400, b = lubridate::today() + runif(1e3) 30, c = 1:1e3, d = runif(1e3), e = sample(letters, 1e3, replace = TRUE) )

controling the number of row when print a data frame

nycflights13::flights %>% print(n = 10, width = Inf)

Exercises

2 Compare and contrast the following operations on a

data.frame and equivalent tibble. What is different? Why

might the default data frame behaviors cause you frustration?

Data frame

df <- data.frame(abc = 1, xyz = "a") df$x df[, "xyz"] df[, c("abc", "xyz")]

tibble

tb <- tibble( abc = 1, xyz = "a")

tb[1] tb$abc tb[, "xyz"]

4. Practice referring to nonsyntactic names in the following data

frame by:

annoying <- tibble( 1 = 1:10, 2 = 1 * 2 + rnorm(length(1)) )

a. Extracting the variable called 1.

annoying["1"]

b. Plotting a scatterplot of 1 versus 2.

ggplot(data = annoying, aes(x = 1, y = 2))+ geom_dotplot(binwidth = .5)

c. Creating a new column called 3, which is 2 divided by 1.

annoying <- annoying %>% mutate("3" = annoying$2/annoying$1)

d. Renaming the columns to one, two, and three:

rename(annoying, "one" = "1", "two" = "2", "three" = "3")

5 what tibble::enframe() do

converting vector to data frame

Chapter 9 tidy data

table1 is a example of tidy data

table1

Tranformation data with dplyr

Compute rate per 10,000

table1 %>% mutate(rate = cases / population * 10000)

Compute cases per year

table1 %>% count(year, wt = cases)

Ploting

ggplot(table1, aes(year, cases)) + geom_line(aes(group = country), color = "grey50") + geom_point(aes(color = country))

Gathering data

table4a have a problem with two cols. Solved it with gather()

The set of columns that represent values, not variables. In this

example, those are the columns 1999 and 2000

The name of the variable whose values form the column names

I call that the key, and here it is year

The name of the variable whose values are spread over the cells

I call that value, and here it’s the number of cases

table4a %>% gather("1999", "2000", key = "year", value = "cases")

We can use gather() to tidy table4b in a similar fashion

table4b %>% gather(1999, 2000, key = "year", value = "population")

combine tidy table4a and 4b in one tibble

tidy4a <- table4a %>% gather(1999, 2000, key = "year", value = "cases") tidy4b <- table4b %>% gather(1999, 2000, key = "year", value = "population") left_join(tidy4a, tidy4b)

Spreading: it´s de opposite of gathering

You use it when an observation is scattered across multiple rows

To use spread() first analyze the representation of data set

identify the key on cols and value on rows or observation

on table2, the key is type and value is count

spread(table2, key = type, value = count)

As you might have guessed from the common key and value arguments,

spread() and gather() are complements. gather() makes

wide tables narrower and longer; spread() makes long tables

shorter and wider.

Exercises

1 Why are gather() and spread() not perfectly symmetrical?

Carefully consider the following example:

stocks <- tibble( year = c(2015, 2015, 2016, 2016), half = c( 1, 2, 1, 2), return = c(1.88, 0.59, 0.92, 0.17)

)

stocks

stocks %>% spread(year, return) %>% gather("year", "return", 2015:2016)

Both spread() and gather() have a convert argument. What does it do?

both tibble arent symmetric because spread() organizated the data begining by

half, and gather() begining by year.

2. Why does this code fail?

table4a %>% gather(1999, 2000, key = "year", value = "cases")

the error its on the key and value, because "year" and "cases" doesnt exist like observation

3. Why does spreading this tibble fail? How could you add a new

column to fix the problem?

people <- tribble( ~name, ~key, ~value,

-----------------|--------|------

"Phillip Woods", "age", 45, "Phillip Woods", "height", 186, "Phillip Woods", "age", 50, "Jessica Cordero", "age", 37, "Jessica Cordero", "height", 156 )

people %>% filter(value < 50 | value > 50) %>% spread(people, key = key, value = value)

people2 <- tribble( ~name, ~key, ~value,

-----------------|--------|------

"Phillip Woods", "age", 45, "Phillip Woods", "height", 186, "Jessica Cordero", "age", 37, "Jessica Cordero", "height", 156 )

spread(people2, key = key, value = value)