appliedepi / epiRhandbook_eng

The repository for the English version of the Epidemiologist R Handbook
Other
95 stars 55 forks source link

3_External_review #262

Open jarvisc1 opened 3 weeks ago

arranhamlet commented 2 weeks ago

Created a section on pmap(), this section was previously listed as “under construction”. Included 3 examples, going from generic to more complex.

arranhamlet commented 2 weeks ago

The function pmap() from the purrr package allows us to apply map_*() functions over multiple vectors. The "p" in pmap() stands for parallel. It works down a a dataset or list sequentially, carrying out your operation. Note, it does not refer to parallel computing.

In pmap() you specify a single dataset, or list that contains all of the vectors, or lists, that you want to supply your function. This can allow you to very quickly carry out calculations with multiple columns of a dataframe, or lists of information.

For example, here is a simple dataset of three numbers.


data_generic <- data.frame(
     A = c(1, 10, 100),
     B = c(3, 6, 9),
     C = c(25, 75, 50)
)

data_generic

Here we are going to using the function sum() from base R to look at what the sum of each row is.


data_generic %>%    #Our dataset
     pmap_dbl(sum)      #The function we want to use

You can see that the function pmap_dbl has gone through each row of the datasets, and summed the values. While there are other ways of carrying out this operation in this example, such as using rowSums() from base, pmap_*() functions are much quicker. Additionally, pmap_*() allows you to input custom functions, and specify more complicated inputs.

For example, here we are going to create a new column to count how many symptoms those in our linelist dataset have.


linelist_symptom_count <- linelist %>%                                                     #Our dataset
     mutate(number_symptoms = linelist %>%                       #Creating a new column to count symptoms
                 select(fever:vomit) %>%                         #Selecting the columns that indicate the presence of symptoms
                 pmap_int(~sum(c(...) == "yes", na.rm = T)))    #Here pmap is looking at each row of symptoms, counting which values are set as "yes" and then summing all the values in the row

#Display the results
linelist_symptom_count %>%
     select(fever:vomit, number_symptoms) %>%
     slice(1:10)

As another example, here we have written our own custom function, using str_glue, see section on Characters and strings to summarise each patient's gender, age, date of onset, outcome and the date of outcome:


#Function
summarise_function <- function(case_id, gender, age, date_onset, date_outcome, outcome, ...){
     str_glue("Case {case_id} who had the gender {gender} and the age {age}, had symptom onset on {date_onset}, and had the outcome of {outcome} on {date_outcome}.")
}

#Run the custom pmap function
linelist_summary <- linelist %>%
     pmap_chr(summarise_function)

#Display only the first 3 for ease of viewing
linelist_summary[1:2]

Note that here we did not even have to specify which columns to use, as they are the same name in the function, summarise_function() as in the dataset. pmap_*() functions automatically map the column or list names to the function.