UM-R-for-EnvSci-Registered-Student / General-Discussion

Public repo for general discussion about the course and assignments
1 stars 0 forks source link

Using stat_summary // geom_col and geom_errors #5

Open AurelieNoel opened 3 years ago

AurelieNoel commented 3 years ago

author: "Aurelie Noel" date: "19/10/2020"

I have a question regarding the item #2 we needed to create for assignment 5. I tried the codes to plot the mean and the standard deviation using the function stat_summary() in the package {ggplot2} - just to compare with the exercice asked (using functions geom_col() and geom_error()) (see below).

sulphate_plot

Find below the code I used with 'stat_summary()`.

to_plot <- clean_names(ditch)%>%
  ggplot()+
  stat_summary(aes(x=site, y=sulphate, colour = site, fill = site), fun=mean, geom = "bar")+
  stat_summary(aes(x=site, y=sulphate), fun.data = mean_se, geom = "errorbar")+
  labs(y="Sulphate concentration")
to_plot

unnamed-chunk-2-1

My understanding is that the argument fun.data uses by default the aggregation function mean_se returning a dataframe with ymin, y and ymax, to create the error bars. So I thought that was the "problem", I didn't use the same values to define the error bars (I used ymin as mean-sd and ymax as mean+sd for the first plot). So, to be sure, I compiled a tibble with the ymin and ymax for each site for the sulphate parameter and it does not match.

summary_sulphate_2 <- clean_names(ditch)%>%
  select(site, sulphate)%>%
  group_by(site)%>% 
  summarise(max_value = max(sulphate, na.rm = TRUE), 
            min_value = min(sulphate, na.rm = TRUE))
summary_sulphate_2
## # A tibble: 5 x 3
##   site   max_value min_value
##   <chr>      <dbl>     <dbl>
## 1 Site 1       712        63
## 2 Site 2       886        86
## 3 Site 3       252         5
## 4 Site 4       198         5
## 5 Site 5       539        55

I think there is something I don't understand but I don't understand what I don't understand. Thank you Aurelie

peperg commented 3 years ago

Aurelie,

I am so sorry! not sure how i missed this one! Are you still looking for an answer?

peperg commented 3 years ago

If i understand your question correctly, you are wondering why the figure that you got from doing the exercise the way we did it in class (using geom_col() and then calculating the SD and adding the error bars as mean+sd and mean-sd) is different from the one you got using stat_summary(). Is that correct?

The reason is that the one we use din class uses the Standard Deviation (SD) while the one that you are using with stat_summary() is the Standard Error (SE) which is calculated as the SD divided by the square root of the sample size, so it is a smaller range.

Otherwise, the two approaches would yield the same if you used the sd() function in the stat_summary() approach.

As to why your max min didn't match, thats because the ones you calculated are the max and min of the actual data, while the ymin ymax in the dataset generated by stat_summary() would be those of the mean+se or mean-se.

AurelieNoel commented 3 years ago

Thanks a lot! You perfectly understood my very unclear question and it's actually totally my mistake by mixing standard deviation and standard error. And whereas I understand what I misunderstood, I'm still not sure how to fix it when you say to use the sd function instead. I'm not exactly sure "where". Whenever you have some time, do you mind actually correcting the script "for me"? Sorry about that...I actually am confident I do not understand completely how stat_summary works so I think I was trying to get a bit too adventurous.