Outline of Module 4

Draft of Module 4 syllabus. Each sub-section has a dedicated issue. This is almost definitely too long, but we will trim when developing the sub-sections.

General notes for Module 4 can be found here

1. Modelling: what is it, and why do it? (#33)

Attempt to define what modelling is and why researchers do modelling.

Modelling is the practice of capturing patterns, mathematically, amid data uncertainty.
Describing the data generating process mathematically.
Talk about different sources of uncertainty that gets in the way of the true data generating process.
The goal of modelling is to be able to predict future observations arising from some data generating phenomena. If we are able to do that well, we can say that we know something about that topic.
inference is about learning something from data using models. We can use models to test hypotheses.

2. Building a simple model (#34)

Take a look at the research question as motivated by M1-3, and build the first simple model.

What do we know about our data already (recap on previous). So what variables will be useful? (Discussion on this, probably a big list).
These are a massive interaction machine. We want to start simple and build up gradually. feature extraction
What are we predicting?
- self-reported health is an ordered categorical measure
Introduce regression - predicting an outcome as a combination of other variables.
- most models are regression models of different forms.
- briefly cover the mathematics and necessary assumptions when making a simple regression lmodel.

3. Interpreting a model (#35)

Models only know about the world you build for them.
Interpretations vary depending on model. For regression models you get coefficients on your inputs.
Coefficients are the contribution of that variable to the outcome assuming you already know all the other variables.
- this is important because the selection of other variables affects your coefficient. You can't interpret coefficients in splendid isolation
- parameter uncertainty + errors.
- Brief interlude on probability bayes vs frequentist
Visualisation helps you understand what the model is telling you. Ask yourself, what does the model think is going on in your data.
- Explicitly test hypotheses. (can we do inference here? Or do we need uncertainty/residuals from the next section?).
- prediction/simulation.
aside: when interpretation goes wrong (case studies).

4. Validating a model (#36)

Ask yourself: how useful is my model at explaining patterns in the data? Is there variability/uncertainty in the data that my model does not capture well?
Trends vs uncertainty. In regression the uncertainty = residuals.
- Reporting/quantifying uncertainty.
- Brief interlude on probability bayes vs frequentist.
Different sources of uncertainty: measurement uncertainty, fitting uncertainty. (in multilevel models you also have modelled uncertainty e.g. random effects). (We would have foreshadowed this in the intro)
We want to learn something general. Fitting is easy, prediction is hard. This importance notion underlies most model evaluation.
Overfitting: model complexity vs out-of-sample prediction. (variance) (regularization)
Underfitting: Not enough useful information (bias)
Carry-over M3's visuals and underly data.
But how do we assess out-of-sample error concretely?
- Cross validation
- Simulations (do they qualitatively match up to your data)

5. Improving a model (#37)

How do we adapt our model to explain more of the variability in the data?
Do we give it more information? (i.e. another variable, more data to train on...)
Or do we change the structure of the model? Increase the complexity...
You can always improve a model but there are real-world considerations: time, expense, expertise.
Improvement isn't a one-dimensional thing: higher precision, higher out-of-sample accuracy, better clarity of communication? Parsimony is often desirable (especially in theoretical models).
Practicalities: benchmarking, book-keeping & version control.
Models are always wrong. Model evaluation is about understanding why your model is wrong and whether the level of incorrectness is acceptable
Real-world significance of models. Think about how your data is structured, remember these are real people.
- Should not treat everyone the same. What is we applied our model to a different country?
- Multilevel models?

Hands-on session.

Three streams. Be super clear of what we want them to do, and provide a template of how to collaborate,.

Example session: Build your own model

A good description of the model (inputs, outputs):
- includes rationale
Reporting the results in any way you see fit.
- includes interpretation
- validation
Evaluating your model
- what have you learnt?
- what are the limitations?
- how could it be improved further?
you can interleave visuals and text in any way you see fit.

Useful resources

Teaching Duration

There is quite a lot of material to cover. We may find that we need more of the time for the taught material (4-5 hrs), leaving less time for the hands-on session.

Time to write this module.

I think realistically each module is at least a day's work. A achievable target may be 2 weeks at 2 FTE.

alan-turing-institute / rds-course

Meta-Issue: Module 4 #30