daramireh / rfordatasciencebook

0 stars 0 forks source link

Part II continue chapter 9 #5

Open daramireh opened 2 years ago

daramireh commented 2 years ago

Separate() on table3

The rate column contains both cases and population variables

table3 %>% separate(rate, into = c("cases", "population"))

separate() by default convert the value of cols in character.

to separate() as integer use convert = TRUE

table3 %>% separate( rate, into = c("cases", "population"), convert = TRUE )

default convert = F

using sep to select the number of digits that separate with

table3 %>% separate(year, into = c("century", "year"), sep = 2)

unite()

unite() is the opposite of separate()

table5 %>% unite(new, century, year)

using the sep option to unite with underscore

table5 %>% unite(new, century, year, sep = "")

Exercise

1 What do the extra and fill arguments do in separate()?

Experiment with the various options for the following two toy

datasets:

tibble(x = c("a,b,c", "d,e,f,g", "h,i,j")) %>% separate(x, c("one", "two", "three"))

tibble(x = c("a,b,c", "d,e", "f,g,i")) %>% separate(x, c("one", "two", "three"))

the extra and fill arguments are taking like mising values in saparate()

if is a extra argument, separate() eliminated that extra argument

if is a fill argument, separete() will write NA like a mising value

mising values

they are two kind of mising values

Explicitly, i.e., flagged with NA.

Implicitly, i.e., simply not present in the data.

stocks <- tibble( year = c(2015, 2015, 2015, 2015, 2016, 2016, 2016), qtr = c( 1, 2, 3, 4, 2, 3, 4), return = c(1.88, 0.59, 0.35, NA, 0.92, 0.17, 2.66) )

stocks %>% spread(year, return) #implicit to explicit

na.rm = TRUE in gather() turn explicit to implicit

stocks %>% spread(year, return) %>% gather(year, return, 2015:2016, na.rm = TRUE)

complete() turn implict to explicit

complete() takes a set of columns, and finds all unique combinations.

It then ensures the original dataset contains all those values,

filling in explicit NAs where necessary.

stocks %>% complete(year, qtr)

It takes a set of columns where you want missing values

to be replaced by the most recent nonmissing value

treatment %>% fill(person)

case study WHO dataset

gather the variable unknow

who1 <- who %>% gather( new_sp_m014:newrel_f65, key = "key", value = "cases", na.rm = TRUE )

getting some hint of the structure of the values

who1 %>% count(key)

change the col name

who2 <- who1 %>% mutate(key = stringr::str_replace(key, "newrel", "new_rel"))

separating the cases, sex and age range

who3 <- who2 %>% separate(key, c("new", "type", "sexage"), sep = "_")

who3 %>% count(new)

drop cols that are repeat

who4 <- who3 %>% select(-new, -iso2, -iso3)

separate sex and age

who5 <- who4 %>% separate(sexage, c("sex", "age"), sep = 1)

all code in one script

who %>% gather(code, value, new_sp_m014:newrel_f65, na.rm = TRUE) %>% mutate( code = stringr::str_replace(code, "newrel", "new_rel") ) %>% separate(code, c("new", "var", "sexage")) %>% select(-new, -iso2, -iso3) %>% separate(sexage, c("sex", "age"), sep = 1)