STAT545-UBC / Discussion

Public discussion
38 stars 20 forks source link

Designing helper functions in order to quickly do summary stats #483

Open aramcb opened 7 years ago

aramcb commented 7 years ago

Let's say I'm using gapminder (i've made a copy called "my_gap") I want to calculate summary means for population, lifeExp, and gdpPercap, ONLY for countries whose population is less than 9000. A solution I have in mind is to use summarise_each and pass in a helper function that performs this mean based on the variable and dataframe i give it

i.e., with(df, mean(population[tap_num <= 9000])

my_gap %>% group_by(country) %>% summarise_each(funs(helper_fxn), population, lifeExp, gdpPercap)

where the helper function is as follows:

avg_less_than9000 <- function(df, variable) with(df, mean(variable[tap_num <= 9000])

The issue here is that when I run the helper, I get the message object "variable" not found.

I realize I could just filter first for countries with populations less than 9000, but what if I want to create a big tibble which also includes summary statistics for countries whose population is more than 9000?

jennybc commented 7 years ago

There's a lot going on here.

"ONLY for countries whose population is less than 9000". Population is changing over time, but you are taking means over time. What is the precise statement of a country you want to operate on? Consider a country that has population less than 9000 and greater than 9000 for different years in the dates. Keep or not?

Programming with dplyr, which is how I'd describe what you're doing with this helper function, is tricky and is also very much in flux right now. I would recommend putting the "smarts" of your filtered mean into some data operations instead of in a function you provide to summarise_each(). That is not long-term advice, but pragmatic immediate advice.

Form whatever country level summary you need, filter it, and note down which countries make the cut.

Then put a normal filter() statement into a pipeline with summarise_each().

I realize I could just filter first for countries with populations less than 9000, but what if I want to create a big tibble which also includes summary statistics for countries whose population is more than 9000?

I don't really see how this hypothetical table would actually work. What rows and columns do you envision?

aramcb commented 7 years ago

Hi @jennybc

Thank you for the help!

The issue I have is I would like to have summary statistics for multiple variables (e.g., average lifeExp, gdpPercap) for the same country when it has different populations and then graph these summary statistics and demonstrate that those statistics vary as a function of the population.

The solution I have come up with is to filter by population for the country (i.e., one filter for population < 9000, one filter for population >9000) and then join these tibbles together so I can graph the statistic.