Finalising version of M4

Check out this pull request on

See visual diffs & provide feedback on Jupyter Notebooks.

Powered by ReviewNB

Overview:

"descriptions of the processes that generate your data" -> Is it true that all modelling requires data?

"This power means that modelling has applications in pretty much any data science problem" -> This statement seems quite generic - I think it would be good to include some examples, or to clarify where the boundaries of data science are drawn, if necessary.

"How does maths learn from data?" -> Perhaps "How does maths allow us to learn from data?" (reading the preceding text, I get the impression that the emphasis is on (human) researchers gaining insight from data, versus the machine itself.

"what and why of statistical modelling" -> The word 'statistical' appears for the first time here. Perhaps clarify above that 'modelling' is taken to be synonymous with 'statistical modelling'.

"Richard McElreath's wonderful Statistical Rethinking" -> Would be good to qualify - perhaps 'wonderfully accessible / readable' etc.?

Summary:

"In general, models are fit to data by finding the most likely set of parameters for that dataset." -> Is this true in general? I am thinking about models which incorporate some form of regularisation, including Bayesian approaches.

"we want to learn about the phenomena" -> It might help to standardise the terminology throughout the material. I noticed that the Overview text mentions the term 'process'. Is 'process' taken to be synonymous with 'phenomenon'?

"Most models learn parameters of probability distributions" -> Is this true for example for Support Vector Machine classifiers (which some people might loosely refer to as 'models')? (Standardising the terminology might help, for example by replacing 'model' with 'statistical model'). Also, might help to clarify by example in the Overview section the types of model that the material considers.

The What and Why of Statistical Modelling

"real-world" -> real world "No model is perfect because the real-world is endlessly complex" -> Agreed, but I can think of at least some scenarios where a simple model gives near-perfect results. Perhaps mention that there is often a trade-off in real-world scenarios between complexity and interpretability -- practical feasibility is another aspect; we might not have enough data or computational resources. "They are infinitely flexible" -> Perhaps say "In theory, they are infinitely flexible"

"What is data" -> The next sentence says "Data are". For consistency, use either the singular or the plural.

"intrinsic variability" -> It might be provide some examples of intrinsic randomness vs. non-intrinsic randomness. Some processes might ultimately be deterministic, but it is nonetheless useful to model them as random processes.

"Most of the modelling concepts covered in this module appear in both frameworks, but it is important to understand where they differ. We will highlight when this is the case." -> Is this actually followed through? I was expecting a Bayesian treatment of the ice cream example, for example, involving subjective probabilities.

P(X|Y) -> What is the interpretation of upper case terms vs. lower case terms?

"Modelling is often fitted" -> Models are often fitted

"the distribution is [0.7,0.3]" -> It would be good to formalise this a bit more. P(x) = 0.7 if x = chocolate, 0.3 otherwise. Or something similar.

"𝑃(𝑥) is a parameter" This is confusing; up until now I interpreted P(x) as a probability. It might be useful to emphasise that we can parametrise distributions. I think there is a substantial difference (conceptually at a high level and mathematically) between the parameter vector [0.7, 0.3] and the probability distribution which it represents. As the text correctly alludes, for the Bernoulli distribution, the parameter is scalar-valued.

"data = bernoulli.rvs(p=.7,size=15)" -> Add a random seed for reproducibility?

"With a big enough sample the amount of people choosing chocolate will always rest at our chosen parameter." -> Perhaps "by increasing the sample size we can approximate the parameter arbitrarily closely, or something" or similar

"Crucially this holds even if the data is not normally distributed (as in our case)." -> Perhaps mention that it also requires that samples are independent.

"we will need to estimate them" -> This seems to imply that statistical learning allows us to select among models as well as their parameters. Is that intended? From the text, I get the impression that stat. learning is about inferring the parameters for a given model.

Fitting (Regression Models)

"they are collection" -> "they are collections"

"We fit a model by minimising a cost function" -> Is this always the case? (See Bayesian approaches).

"Let us have a dataset consisting of one random variable" -> The text introduces the notion of a random variable here for the first time. Possible to explain in preceding text?

"Let us have a dataset consisting of one random variable" -> If it is a single random variable, then why/what do we sum over in the following equation? Shouldn't it be N independent and identically distributed random variables? Do we actually need to introduce random variables here?

"The mean is a poor model" -> I don't believe it is a poor model, if the goal is simply to predict the next random variate.

"computationally infeasible" -> "can lead to issues with numerical precision"

"called convergence" -> Not all optimisers involve the notion of convergence.

"𝐗=𝐱" -> This vector notation appears here for the first time.

"From linear regression to logistic regression" -> For completeness, the material should introduce the cost function and how we optimise parameters.

Evaluating models

"self reported" -> "self-reported"

"The goodness-of-fit metrics alone are not enough for a full evaluation of the model. A fit can learn perfectly the data but can be overfitted, meaning that they have learned the pecularities (noise) of the dataset and will not make good predictions on future unseen samples (we touched on overfitting in the section Fitting a Model)." -> There ways to express goodness of fit which account for model complexity and which therefore attempt to address overfitting.

Comments on the first couple of sections - more to come next week.

Overview

I really like the introductory paragraph - gives a great, concise explanation of why we should be interested.

"Evolution a a model" => "Evolution of a model"

What and why of statistical modelling

"wide variety machine learning techniques" => "wide variety of machine learning techniques"
"measuring device is self-report" => "measuring device is self-reporting"
"How a data generating process varies across a population is called the distribution" hmm.. I guess so, but personally this doesn't give me a good idea of what a "distribution" is... Maybe something like "Mapping out the data generating process over the space of all possible events gives rise to what we call a 'distribution'." ?
In the diagram in the "What is Data" section, should the "Sample" bubble say "Sampling process" instead? I would say the "Sample" (noun) is the data...
"What is Probability" section: I think "This means that if one event happens all the time without fail it happens 100% of the time (P(x) = 1)" contains some redundant phrasing, and doesn't really explain what "x" is. Maybe: "This means that if one event x happens all the time, P(x)=1." ?
In "Conditional Probability" section: "likeihood" => "likelihood"
"Modelling boils down inferring what the parameter values are for a given set of data" => "Modelling boils down to inferring the parameter values that are most consistent with a given set of data." maybe?
"Parameters and Distributions" section: "let us assigned" => "let us assign".
In this example, it says we are setting P(chocolate)=1, but then the text says we get "almost some strawberry choosers". I guess it should be P(chocolate)=0.7, and "almost some" should be "a few" ?
"Sampling bias and the Central Limit Theorem", on the bottom plot, might the point be made more clearly if the left plot was a histogram, and had y-axis title "number of ways"?

Fitting Models

"symetrical" => "symmetrical"
"Most RQs" => "Most research questions" ?
"dependent on on other variables" => "dependent on other variables"
"More complex models but from " => "More complex models can be formed by "
"conceptrually similar" => "conceptually similar"

Some comments on the remaining sections, mostly typos:

4.3 Building a model

First sentence: "high level" => "high-level" and "modeling" => "modelling"
In the "Recap", it looks like there is a mix of quotation marks when listing the variables.
Third code cell has FileNotFoundError for "'data/UKDA-7724-csv/csv/eqls_2011.csv'"
Another "missigness" => "missingness" and "modeling"=>"modelling" (probably good to do general find/replace for these).
"Logistic Regression" section: I think maybe it would be better to move the "Recall from section 4.2" and the formula, and start the section with "The odds of ". Then, when you have justified why to use log(odds), say "this gives us the logistic regression formula from section 4.2".
"Simple model 1" section - I think this needs to be taken slowly in the actual course, as there's quite a lot to get one's head around.... e.g. I'm not really clear from here on why we can expect "most of the responses will be true positives" ...
"In section 4.3 Evaluating and Validating Models" => "In section 4.4 Evaluating Models"
"Interpreting the fit output" section: " representing amount of variation" => "representing the amount of variation", and "signifcance" => "significance"
"seeying" => "seeing"
"we have dessingange the effects" => "we have disengaged the effects"

Evaluating models

I think this section definitely needs a quick glossary defining the terms "accuracy", "precision" and "recall".
"Data Processing" second code cell has a FileNotFoundError
"a simple model described in section 3.5 " => "the simple model described in section 4.3"
"coefficitents, standard erros" => "coefficients, standard errors"
"do not look to different" => "do not look too different"
"lager than one" => "larger than one"
"courve" => "curve" in a couple of places
"granilarity" => "granularity"
"accompation" => "accomodation"
"precission" => "precision"
"with such imbalance dataset" => "with such an imbalanced dataset"
"ortogonal" => "orthogonal"
"excersice with did" => "exercise we did"
"necesarily accurately classify" => "necessarily to accurately classify"
"completly ortogonal to eachother" => "completely orthogonal to each other"
"loose model performance" => "lose model performance"
"slowily" => "slowly"

Module looks really good, nice work!

Have made a few notes in some spare time over last few days. Hope they're helpful. Tried not to replicate the above reviews. Some points are pretty subjective so won't be offended at all if you ignore them!

The What and Why of Statistical Modelling

“There is a trade-off between model complexity and scientific value.”  I think this can easily be read out of context of for explanation and I’m not sure this is a fair statement in that, more general, case.

Fitting (Regression) Models

“Notice that our data actually seemed related to another variable when plotted in two dimensions.” I'm not quite sure what this means but may be being dense!
“Since we already are under the assumption that 𝑌 varies around a mean, it stands to reason that we would model the mean as changing with 𝑌, accepting that 𝑋 would vary around this new moving average. “ this is the first mention of X, worth introducing this term first?
Fitting Algorithms margin note may important enough to move to the main body?
Code Preliminaries not rendering as expected

code block to fit the linear regression model with MLE. in order to make sense offline (not as part of a taught course), this may benefit from more comments, particularly in the initialisation of variables: perhaps something like:

…
# n samples evenly spaced .4 either side of previously defined b1
# we will exhaustively search over this space of possible b1 values
b1s = np.linspace(b1-.4,b1+.4, n)
# use previously defined variable and data (above) for X and Y, respectively
X = variable
Y = data 
…

“(Bishop toy example), can show some plots from” outstanding TODO?
Normal Linear Regression define this as a linear regression model with assumption of multivariate normality? mentioned later in “Beyond Normal Linear Regression”

4.3 Building a simple model

if we want a consistent data location across modules, we’re using $REPOSITORY_ROOT/data/UKDA-7724-csv in m2.
“Finally, we must dichotomise our SRH variable.” maybe worth more of a discussion? This follows the paper and keeps everything simple but might not be the best choice in all scenarios.
p(x) axes range for DeprIndex plots. Allows to match against other plots but maybe unexpected range (not [0,1])
“meaning the point along the DeprIndex scale where the regression model changes its prediction of which outcome is more likely” equation below not rendering as expected

4.4 Evaluating models

“This means that the classification matrix, which show that our model is unable to ever predict the negative labels existing in the dataset” I get two negative predictions in the confusion matrix
“A better evaluation would be to estimate these metrics as a function of the minority class in the following way:” I’m not sure this is better. It depends what you’re interested in calling a “true positive”. also metrics.recall_score(testy_model1_minority, pred_y_model1_minority), i.e. recall of negative class, is often referred to as specificity (tn/(tn+fp))
Going back to our research question is it worth mentioning standardised beta coefficients as a measure of feature importance? agree that likelihood ratio more reliable. I guess natural q in this case would be, how many models do you need to fit to answer the research question?

General

some spelling and grammar errors, may be worth using an auto checker
worth a brief aside to mention categorical data? (and the pain it introduces!) The data as treated as continuous in the examples as noted in 4.4
might be worth mentioning balancing datasets (at train and/or test time), even if just as a discussion rather than something practical and taken forward in the modelling.

Still some action to take on the reviews but merging into develop for the course start (15/11/21)

alan-turing-institute / rds-course

Module 4 #71

Overview:

Summary:

The What and Why of Statistical Modelling

Fitting (Regression Models)

Evaluating models

Overview

What and why of statistical modelling

Fitting Models

4.3 Building a model

Evaluating models

The What and Why of Statistical Modelling

Fitting (Regression) Models

4.3 Building a simple model

4.4 Evaluating models

General