Meta-issue: Module 1 (hands-on)

Outline of Module 1 (hands-on material):

Research question

Understand the associations between SES/material circumstances and health using the EQLS dataset (a survey micro-dataset). The research question could initially be broad and we aim to narrow it down and define it better in this module and then develop across the other modules #15 .

Dataset

Found here

Hands-on tasks:

Discuss and document the following in groups:
- As researchers, we break down the question into a simplified series of questions, for example:
- What is SES/material circumstances?
- What is health? What is the outcome measure of our model?
- How do we establish an association? What are the different ways we can think about the problem and what theories can be used? Maybe provide some pointers?
- What does the dataset contain and how can we use it to answer the question?
- How do we translate the question to a data science task?
- What is a good MVP? What is our performance metric?
- What is the purpose of doing it and how will it be used?
- How can we challenge the question and dataset? What is missing/controversial/biased, what are the EDI concerns? Multiple EDI issues embedded in the problem for attendees to point out and discuss.
Setup a GitHub repo and use it to document the conversation and outcome of the activity. Learn how to use Github as the basis of collaborative work.

Resources

Tools

GitHub
hackmd
some postit collaborative tool?
slack

Useful books/references:

Connection to other modules

Stages to answer the question through the course:

Exploratory visuals to get familiar with the dataset. This will be done in M3, with the chosen examples neatly foreshadowing the relationships we examine in M4.
A predictive model is built to understand if an initial set of variables can predict the outcome. This is a typical but imperfect way to understand if there are associations
More variables are added and/or a different model is used.
The model predicts in one country but maybe fails in others so we discuss methods to address that (possibly multilevel modeling).
A good predictive model does not necessarily answer the question of which variables are associated. We then discuss other steps to improve our answer.
We don't pretend to answer the question since we don't have time for it. Rather the last part of M4 will be discussing future directions and point the students in the direction of how they could build upon what we have done.

Duration of the session

4 hours including two 10 minute breaks and one 30 minute break

Intro: 5 minutes Phase 1: 20 mins setup (in groups), 35 mins collaborative activity (exploration of materials and discussion, in groups) Phase 2: 40 mins collaborative activity (scoping, in groups), 20 mins presentation (all together) Phase 3: 40 mins collaborative activity (EDI discussion, in groups), 20 mins presentation (all together)

Time to write this module

Steps to do:

go through the received instructions (they will be a short proposal from a PI on a project idea together with a dataset)
set up a github repo for each group (manage access rights, prepare a project board)
Conduct the initial scoping (we should think if one of us could be there acting as the PI), capture all answers to the scoping questions in dedicated issues, check the license of the dataset
As part of the scoping, open a dedicated branch, load the dataset and explore it, decide how and where to store it, review the PR for merging the branch
When scoping is done let's have an open group discussion around the main points
Then each group will now discuss about some starting ethical questions and might add others
Final open discussion with everyone

Initial drafted ethical questions:

Is the biased in any way?
Dangerous ways in which it could be used (both data and method)?
Variable definitions? <-- binary classification of mental health, issues with this?
What we don't know about the people involved?
Losing information about variables
Are you informed in what is missing? <-- gender for instance? nationality? adults alone? why do we exclude certain people

Ideal timeline: Go through instructions, getting up github repo in groups (access rights, project board) <-- 20/30 mins Initial scoping in groups (discuss questions, looking at the dataset, having it in branch, PR, check license) <-- 1h break <-- 10 mins conversation with the whole group <-- 30 mins discuss ethical questions in groups <-- 30 mins final open discussion <-- 30 mins

alan-turing-institute / rds-course