brad-cannell / r4epi

Repository for the R for Epidemiology book
http://www.r4epi.com/
Other
18 stars 50 forks source link

Add vocabulary to Using R for Epidemiology chapter #81

Open mbcann01 opened 1 year ago

mbcann01 commented 1 year ago

We may want to introduce some basic vocabulary very early in the book. This is not a complete appendix, it's just a short review chapter that will get us up to speed. Here is a running list of potential words to start with:

mbcann01 commented 1 year ago

Measurement: We typically evaluate and quantify our intuition about health and disease with numbers. This could be relative to anecdotes. Surely, the one person who happened to survive cancer through a diet exclusively of honey could have had an article written about them. That doesn’t mean honey cured that person’s cancer, and even if it did, it doesn’t mean that it would be likely to cure cancer in you.

These terms are a source of confusion for many. Including me. I think the source of the confusion lies in the difference between how we use them in everyday speech and how we should technically use them when talking about defined mathematical quantities.

The purpose for measuring things is to analyze them. In his book, The Book of Why, Dr. Judea Pearl discusses the three rungs on the ladder of causation. His first rung is association. In epidemiology, I would argue that we actually have a lower, yet still useful, rung — description.

Variable types - discrete/categorical and continuous.

Defining an event. Sometimes can be very tricky.

Defining a time-frame.

Count

Ratio.

Proportion: A proportion refers to the fraction of the total that possesses a certain attribute. It is something we can observe. For example, there are 10 people in our sample and 3 of them are living with diabetes. Therefore, the proportion of people in our sample who are living with diabetes is 3 out of 10, or 3/10. Equivalently, we could divide 3 by 10 and say that the proportion is 0.3. We could also multiple 0.3 by 100 and say that 30% (“percent” just means per 100. In this case, per 100 people) of the people in our sample are living with diabetes. These are all just mathematical transformations of the same quantity (i.e., 3 people in a group of 10 people who are living with diabetes).

Probability: A probability may look similar to a proportion (i.e., 0.3), but it is more of a theoretical concept than an observed quantity. It is also closely related to the idea of uncertainty. Previously, when we talked about proportions, we said that we “observed” (i.e., had measured data about) 10 real people. Three of those real people were living with diabetes. Therefore, the proportion

In Frequentist (or Classical) statistics, we suppose that our sample of data is the result of one of an infinite number of exactly repeated experiments. The sample we see in this context is assumed to be the outcome of some probabilistic process. Any conclusions we draw from this approach are based on the supposition that events occur with probabilities, which represent the long-run frequencies with which those events occur in an infinite series of experimental repetitions. For example, if we flip a coin, we take the proportion of heads observed in an infinite number of throws as defining the probability of obtaining heads. Frequentists suppose that this probability actually exists, and is fixed for each set of coin throws that we carry out. The sample of coin flips we obtain for a fixed and finite number of throws is generated as if it were part of a longer (that is, infinite) series of repeated coin flips (see the left-hand panel of Figure 2.1 ). In Frequentist statistics the data are assumed to be random and results from sampling from a fixed and defined population distribution. For a Frequentist the noise that obscures the true signal of the real population process is attributable to sampling variation – the fact that each sample we pick is slightly different and not exactly representative of the population. We may flip our coin 10 times, obtaining 7 heads even if the long-run proportion of heads is 0.5. To a Frequentist, this is because we have picked a slightly odd sample from the population of infinitely many repeated throws. If we flip the coin another 10 times, we will likely get a different result because we then pick a different sample.

Lambert, Ben. A Students Guide to Bayesian Statistics (pp. 17-18). SAGE Publications. Kindle Edition.

Bayesians do not imagine repetitions of an experiment in order to define and specify a probability. A probability is merely taken as a measure of certainty in a particular belief. For Bayesians the probability of throwing a ‘heads’ measures and quantifies our underlying belief that before we flip the coin it will land this way. cause and effect. They are merely abstractions which we use to help express our uncertainty. In this frame of reference, it is unnecessary for events to be repeatable in order to define a probability. We are thus equally able to say, ‘The probability of a heads is 0.5’ or ‘The probability of the Democrats winning the 2020 US presidential election is 0.75’. Probability is merely seen as a scale from 0, where we are certain an event will not happen, to 1, where we are certain it will (see the right-hand panel of Figure 2.1 ). A statement such as ‘The probability of the Democrats winning the 2020 US presidential election is 0.75’ is hard to explain using the Frequentist definition of a probability. There is only ever one possible sample – the history that we witness – and what would we actually mean by the ‘population of all possible US elections which happen in the year 2020’? For Bayesians, probabilities are seen as an expression of subjective beliefs, meaning that they can be updated in light of new data. The formula invented by the Reverend Thomas Bayes provides the only logical manner in which to carry out this updating process. Bayes’ rule is central to Bayesian inference whereby we use probabilities to express our uncertainty in parameter values after we observe data. Bayesians assume that, since we are witness to the data, it is fixed , and therefore does not vary. We do not need to imagine that there are an infinite number of possible samples, or that our data are the undetermined outcome of some random process of sampling. We never perfectly know the value of an unknown parameter (for example, the probability that a coin lands heads up). This epistemic uncertainty (namely, that relating to our lack of knowledge) means that in Bayesian inference the parameter is viewed as a quantity that is probabilistic in nature. We can interpret this in one of two ways. On the one hand, we can view the unknown parameter as truly being fixed in some absolute sense, but our beliefs are uncertain, and thus we express this uncertainty using probability. In this perspective, we view the sample as a noisy representation of the signal and hence obtain different results for each set of coin throws. On the other hand, we can suppose that there is not some definitive true, immutable probability of obtaining a heads, and so for each sample we take, we unwittingly get a slightly different parameter. Here we get different results from each round of coin flipping because each time we subject our system to a slightly different probability of its landing heads up. This could be because we altered our throwing technique or started with the coin in a different position. Although these two descriptions are different philosophically, they are not different mathematically, meaning we can apply the same analysis to both.

Lambert, Ben. A Students Guide to Bayesian Statistics (pp. 18-19). SAGE Publications. Kindle Edition.

Whoa! What does that mean? Well, when we talked about a proportion, there were

Difference between proportion and probability: A probability is a hypothetical property. Proportions summarize observations.

Conditional probability.

Risk.

Likelihood.

Data is not as objective as we would like to believe. Data is not inherently objective. There are assumptions and subjective decisions made at every step of a conclusion derived from data — collection (what do we collect and who do we collect it from, how often do we collect it), cleaning (missing data, calculated variables), analysis (statistical assumptions), interpretation. I’m not saying this is a “bad” thing. It’s not. It’s just the way science currently works. But, we should be as honest and transparent about it as possible.