alan-turing-institute / rds-course

Materials for Turing's Research Data Science course
https://alan-turing-institute.github.io/rds-course/
31 stars 13 forks source link

Module 4 #71

Closed crangelsmith closed 2 years ago

crangelsmith commented 2 years ago

Finalising version of M4

review-notebook-app[bot] commented 2 years ago

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

pafoster commented 2 years ago

Overview:

"descriptions of the processes that generate your data" -> Is it true that all modelling requires data?

"This power means that modelling has applications in pretty much any data science problem" -> This statement seems quite generic - I think it would be good to include some examples, or to clarify where the boundaries of data science are drawn, if necessary.

"How does maths learn from data?" -> Perhaps "How does maths allow us to learn from data?" (reading the preceding text, I get the impression that the emphasis is on (human) researchers gaining insight from data, versus the machine itself.

"what and why of statistical modelling" -> The word 'statistical' appears for the first time here. Perhaps clarify above that 'modelling' is taken to be synonymous with 'statistical modelling'.

"Richard McElreath's wonderful Statistical Rethinking" -> Would be good to qualify - perhaps 'wonderfully accessible / readable' etc.?

Summary:

"In general, models are fit to data by finding the most likely set of parameters for that dataset." -> Is this true in general? I am thinking about models which incorporate some form of regularisation, including Bayesian approaches.

"we want to learn about the phenomena" -> It might help to standardise the terminology throughout the material. I noticed that the Overview text mentions the term 'process'. Is 'process' taken to be synonymous with 'phenomenon'?

"Most models learn parameters of probability distributions" -> Is this true for example for Support Vector Machine classifiers (which some people might loosely refer to as 'models')? (Standardising the terminology might help, for example by replacing 'model' with 'statistical model'). Also, might help to clarify by example in the Overview section the types of model that the material considers.

The What and Why of Statistical Modelling

"real-world" -> real world "No model is perfect because the real-world is endlessly complex" -> Agreed, but I can think of at least some scenarios where a simple model gives near-perfect results. Perhaps mention that there is often a trade-off in real-world scenarios between complexity and interpretability -- practical feasibility is another aspect; we might not have enough data or computational resources. "They are infinitely flexible" -> Perhaps say "In theory, they are infinitely flexible"

"What is data" -> The next sentence says "Data are". For consistency, use either the singular or the plural.

"intrinsic variability" -> It might be provide some examples of intrinsic randomness vs. non-intrinsic randomness. Some processes might ultimately be deterministic, but it is nonetheless useful to model them as random processes.

"Most of the modelling concepts covered in this module appear in both frameworks, but it is important to understand where they differ. We will highlight when this is the case." -> Is this actually followed through? I was expecting a Bayesian treatment of the ice cream example, for example, involving subjective probabilities.

P(X|Y) -> What is the interpretation of upper case terms vs. lower case terms?

"Modelling is often fitted" -> Models are often fitted

"the distribution is [0.7,0.3]" -> It would be good to formalise this a bit more. P(x) = 0.7 if x = chocolate, 0.3 otherwise. Or something similar.

"𝑃(𝑥) is a parameter" This is confusing; up until now I interpreted P(x) as a probability. It might be useful to emphasise that we can parametrise distributions. I think there is a substantial difference (conceptually at a high level and mathematically) between the parameter vector [0.7, 0.3] and the probability distribution which it represents. As the text correctly alludes, for the Bernoulli distribution, the parameter is scalar-valued.

"data = bernoulli.rvs(p=.7,size=15)" -> Add a random seed for reproducibility?

"With a big enough sample the amount of people choosing chocolate will always rest at our chosen parameter." -> Perhaps "by increasing the sample size we can approximate the parameter arbitrarily closely, or something" or similar

"Crucially this holds even if the data is not normally distributed (as in our case)." -> Perhaps mention that it also requires that samples are independent.

"we will need to estimate them" -> This seems to imply that statistical learning allows us to select among models as well as their parameters. Is that intended? From the text, I get the impression that stat. learning is about inferring the parameters for a given model.

Fitting (Regression Models)

"they are collection" -> "they are collections"

"We fit a model by minimising a cost function" -> Is this always the case? (See Bayesian approaches).

"Let us have a dataset consisting of one random variable" -> The text introduces the notion of a random variable here for the first time. Possible to explain in preceding text?

"Let us have a dataset consisting of one random variable" -> If it is a single random variable, then why/what do we sum over in the following equation? Shouldn't it be N independent and identically distributed random variables? Do we actually need to introduce random variables here?

"The mean is a poor model" -> I don't believe it is a poor model, if the goal is simply to predict the next random variate.

"computationally infeasible" -> "can lead to issues with numerical precision"

"called convergence" -> Not all optimisers involve the notion of convergence.

"𝐗=𝐱" -> This vector notation appears here for the first time.

"From linear regression to logistic regression" -> For completeness, the material should introduce the cost function and how we optimise parameters.

Evaluating models

"self reported" -> "self-reported"

"The goodness-of-fit metrics alone are not enough for a full evaluation of the model. A fit can learn perfectly the data but can be overfitted, meaning that they have learned the pecularities (noise) of the dataset and will not make good predictions on future unseen samples (we touched on overfitting in the section Fitting a Model)." -> There ways to express goodness of fit which account for model complexity and which therefore attempt to address overfitting.

nbarlowATI commented 2 years ago

Comments on the first couple of sections - more to come next week.

Overview

I really like the introductory paragraph - gives a great, concise explanation of why we should be interested.

What and why of statistical modelling

Fitting Models

nbarlowATI commented 2 years ago

Some comments on the remaining sections, mostly typos:

4.3 Building a model

Evaluating models

lannelin commented 2 years ago

Module looks really good, nice work!

Have made a few notes in some spare time over last few days. Hope they're helpful. Tried not to replicate the above reviews. Some points are pretty subjective so won't be offended at all if you ignore them!

The What and Why of Statistical Modelling

Fitting (Regression) Models

4.3 Building a simple model

4.4 Evaluating models

General

callummole commented 2 years ago

Still some action to take on the reviews but merging into develop for the course start (15/11/21)