STAT545-UBC / Discussion

Public discussion
38 stars 20 forks source link

Issue with using scale_y_log10() and its axis labels #343

Open ghost opened 8 years ago

ghost commented 8 years ago

Hi,

I was playing around with the data and I found a potential issue that I thought should be flagged, just in case anyone was looking to use this for their own projects. If you use the log 10 base (or any other, I assume) on the axes with scale_y_log10, you shouldn't use the axes labels as a guide for the value of a data point.

options(scipen = 999) #Avoiding scientific notation (Reference: http://stackoverflow.com/questions/5352099/how-to-disable-scientific-notation-in-r)
library(gapminder)
library(tidyverse)
p <- ggplot(gapminder, aes(x = year, y = gdpPercap))
p + geom_point() + scale_y_log10() + 
stat_summary(fun.y = mean, colour = "red", geom = "point", size = 5) #from here: https://github.com/jennybc/ggplot2-tutorial/blob/master/gapminder-ggplot2-stripplot.md

(TableMeans <- gapminder %>% 
group_by(year) %>% 
summarise(Avg = mean(gdpPercap)))

You can see how in the 2007 column of points at the far right the red dot appears to be under 10,000, even though the table shows that the mean is actually 11,680.

I looked around and found this stack overflow post that recommended using coord_trans(y = "log10") instead of scale_y_log10().

p + geom_point() + coord_trans(y = "log10") + 
stat_summary(fun.y = mean, colour = "red", geom = "point", size = 5) #Axis tick labels make more sense.

However, I also noticed the chart gets messed up if you try to use this with coord_flip():

p + geom_point() + coord_trans(y = "log10") + 
stat_summary(fun.y = mean, colour = "red", geom = "point", size = 5) + coord_flip()
jennybc commented 8 years ago

This is good for everyone to think about...

So what you should be expecting to see as the big red "mean" dots here is really geometric mean, not the arithmetic mean.

The data gets logged and then averaged, leading to the visible red dots.

And I think this is what you want. If you've decided that a variable should be treated on the log scale, then you would want to take the average there. Not take the average on the raw scale and then log it.

Consider this example of 10 ratios: (1/2, 2, 1/3, 3, 1/4, 4, 1/5, 5), naturally grouped in pairs. I think most people would agree that the "typical" value 1/2 and 2 is 1 (assuming you believe them to be ratios) and so on. That means you should work with the data on the log scale and you take the geometric mean, not arithmetic. See the example below for a visual explanation. If you put arithmetic means on this plot, you get the blue dots. The geometric mean which I computed "by hand" and what ggplot2 does by default are the red and greed dots tracking across at 1.

library(tidyverse)
geomean <- function(x) {
  exp(mean(log(x)))
}
df <- tibble(
  x = rep(2:6, each = 2),
  y = x ^ rep(c(1, -1), 5)
)
df_avg <- df %>% 
  group_by(x) %>% 
  summarise(arith = mean(y), geom = geomean(y))
# ggplot(df, aes(x, y)) + geom_point()
# ggplot(df, aes(x, y)) + geom_point() + scale_y_log10()
ggplot(df, aes(x, y)) + geom_point() + scale_y_log10() +
  stat_summary(fun.y = mean, colour = "red", geom = "point", size = 5) +
  geom_point(aes(x = x, y = arith), colour = "blue", size = 5, data = df_avg) +
  geom_point(aes(x = x, y = geom), colour = "green", size = 2, data = df_avg)

jennybc commented 8 years ago

you shouldn't use the axes labels as a guide for the value of a data point.

The axis labels are a reliable guide for the value of a data point. If you have a tick mark at c and a point falls below, the data value is below c. And vice versa.

Now, I agree it is very hard for people to interpolate between tick marks and guidelines on logged axes. Because our brain really wants to interpolate linearly which is incorrect.

But I promise, the points are being drawn in the correct place.