Watts-College / cpp-527-fall-2021

A course shell for CPP 527 Foundations of Data Science II
https://watts-college.github.io/cpp-527-fall-2021/
2 stars 6 forks source link

Lab 3 clap score #16

Open Johaning opened 2 years ago

Johaning commented 2 years ago

I understand the mathematical process of finding the log of the number of claps, and can see how it changes from a skewed distribution to a more normal curve. My question is why it's important to do this. For questions 1a and 1b, what is the benefit of reporting the logged clap score rather than the actual number of claps? It seems more intuitive to me to report the actual number, since I can picture the difference between 800 and 200 claps but can't conceptualize the logged values the same way.

lecy commented 2 years ago

The mean is sensitive to outliers so it can be misleading.

For example, a style that on average performs poorly but one article has a large number of claps:

x1 <- c(1,1,1,1,50)
x2 <- c(5,5,5,5,5)
mean( x1 )
[1] 10.8
mean( x2 )
[1] 5

It makes it look like style 1 here is better, when the typical article using style 2 has more claps.

Using a logged value will minimize the impact of outliers:

log.x1 <- log(x1)
log.x2 <- log(x2)
mean( log.x1 )
[1] 0.7824046
mean( log.x2 )
[1] 1.609438

You can always unlog the value to recover the more interpretable score:

exp( mean( log.x1 ) )
[1] 2.186724
exp( mean( log.x2 ) )
[1] 5

Alternatively, the median is not as sensitive to outliers so it could be used as well.

Johaning commented 2 years ago

Thank you, that explanation helped!