Find main dataset - Githubissues

Candidates:

COMPAS dataset
Population micro-data, e.g. living standards surveys, census - example
Other ProPublica data
Parliamentary speeches
QUIPP list of datasets
Turing data stories datasets
UK Biobank

Initial discussion prioritised the COMPAS dataset with living standards/census as a second choice. We still have not examined QUIPP, other ProPublica data, Biobank, Born in Bradford and Turing data stories.

We are in touch with EAG to request initial feedback on COMPAS dataset, specifically around ethics, legality and potential reputational damage. We have also contacted ProPublica to clarify licensing for the dataset.

Martin's feedback was that we should make sure context of what we are trying to do is provided in the publicly available material and that he would like to have Kirstie's opinion. He also said some health or genetics data which is consented and public might be a good alternative, e.g. UK Biobank.

A couple of other ideas for datasets:

WHO Global School-Based Student Health Survey
- Standardised global survey with 140+ fields, completely open for several countries e.g. here). Datasets contain some demographic and characteristics data (age, sex, height, weight), questions about eating habits, personal hygiene, violent behaviour, bullying, smoking, alcohol, sexual behaviour, parents' involvement, drug use, school attendance, hunger, mental health. Also available are the Global Adult/Youth Tobacco Surveys.
- E.g. Indonesian data have ~11,000 rows x 140+ columns
- Data is open but redistribution is not allowed.
- Possible research questions:
- What factors predict violent behaviour, drug use or alcohol (e.g. parent behaviour, going to school, eating habits)
Chilean study
- Data/code from this recent paper that connects socio-economic status and COVID incidence/mortality in Chile paper, code+data
- Probably not ideal for the ML modules as it has area-level data (not individual level) so few rows in total. But could be used in the taught parts for various purposes (demonstration of good visualisations, how data science can be used to expose inequalities, etc)

Discussion with Turing Commons (see #7 ): It was proposed that we could create an artificial dataset with specified biases, correlations etc based on understanding of an area, e.g. healthcare, criminal justice in collaboration with domain experts. This would allow us to introduce the types of characteristics we want but would not be a real-world dataset and might take some time to build.

If we decide to do this, a possible tool to use for healthcare data is synthea

Some open/safeguarded datasets are available from the UK Data Service (some of them are designed for teaching and include guides, documentation and in some cases possible questions for students to answer using the data).

Safeguarded datasets would require all student to accept the terms of this agreement and then download the dataset manually and also the instructor should create a project in the UK Data Service website and explain how the dataset will be used. We cannot redistribute the data (e.g. put them in a repository). Open datasets are under Open Government License (see here) and would allow us to redistribute, edit, etc the data.

List of most interesting datasets for the RDS course (look at the documentation tabs in each link for details):

Quarterly Labour Force Survey (safeguarded version and open version)
- Individual level data, contains variables relating to socio-demographics, employment, housing tenure, education and health (e.g. Age, Age of youngest child, Year of arrival in UK, Gross weekly pay, Gross hourly pay, Year started with current employer, Total usual hours, Age when completed education, bad health).
- ~55,000 rows x 53 columns (safeguarded) ~22,000 rows x 13 columns (open - includes much fewer variables about employment, pay, etc).
- Possible research questions:
- Can we predict pay/quality of employment from socio-demographics and other variables (e.g. bad health, some unpaid work in parallel)?
- Which people are most likely to be unemployed/not looking for work?
British Crime Survey (safeguarded version and open version)
- Individual level survey data, contains demographic and socio-economic details, attitudes to the Criminal Justice System, fear of crime, victimization and antisocial behaviour (e.g. Age, gross income in the household, worry about being victim of a crime, effectiveness of the Criminal Justice System (CJS), Fairness of the CJS, Confidence in Police, Antisocial behaviour in the area, experiences, attitudes).
- ~35,300 rows x 127 columns (safeguarded) and ~11,600 rows x 35 columns (open, much fewer replies to survey questions - mostly includes replies to questions about worries).
- Possible research questions:
- Can we predict confidence in police from age, sex, ethnicity, socio-economics, anti-social behaviour in neighbourhood, worry about crime, etc? (see here)
- Can we predict attitudes towards the CJS from various socio-economic and experiences/opinions?
- Which people are most worried about crime?
National Survey of Sexual Attitudes and Lifestyles (safeguarded and open)
- Individual level survey data containing family background, sources of sex education, methods and sources of contraception, sexual attraction and sex history, fertility, alcohol, smoking and other drug use, relationships including relationship status and happiness, attitudes towards sexual lifestyles and behaviours such as adultery, same sex relationships and sex in the media and many demographic variables.
- ~15,000 rows x 130 columns (safeguarded) and ~3,800 rows x 20 columns (open).
- Possible research questions (e.g. check this):
- Can we predict attitudes to sex from various socio-economic, demographic, employment, religious, etc factors?
- How do we learn about sex? Does this vary across the generations and depending on family or socioeconomic background?
- Factors relating to age at which individuals have their first child
- Factors affecting depression score
Health Survey for England (safeguarded)
- The Health Survey for England (HSE) series is designed to monitor trends in the nation's health. This year's data focus on cardiovascular disease and contain hierarchies (households, area, etc), BMI variables, key household and individual socio-economic characteristics, various variables about health and habits.
- ~36,300 rows x 35 columns
- Possible research questions:
- Risk factors associated with high systolic blood pressure (e.g. see here)
- Factors that predict obesity (example)
British Cohort Studies Teaching Dataset for Higher Education (safeguarded)
- Contains longitudinal data from a number of individuals monitored when they were children and adults. Includes socio-economic information parental education, family social class in childhood, and cohort members’ own education, employment and occupation experiences. Also contains mental health measurements, behavioural and cognitive indicators.
- ~ 17,500 rows x 45 columns
- Possible research questions:
- Which factors from childhood impact mental health, cognitive development?
Workplace Employment Relations Survey (safeguarded)
- Employee-level and employer-level data that includes variables such as:
- For employees: Working hours, job influence, job satisfaction, working arrangements, training and skills, information and consultation, employee representation, pay, workforce composition
- For employers: Management of personnel and employment relations, recruitment and training, workplace flexibility and the organisation of work, consultation and information, employee representation, payment systems and pay determination, grievance, disciplinary and dispute procedures, equal opportunities, work-life balance, workplace performance. Examples: How satisfied are you with your job security? How many competitors do you have for your (main) product or service?
- ~22,451 rows x 125 columns (employees) and ~2,300 rows x 250 columns (employers)
- Possible research questions:
- Factors connected to job satisfaction (both personal and employer-related)
Understanding Society, Wave 3, 2011-2012 (safeguarded)
- A multi-topic household survey, the purpose of Understanding Society is to understand social and economic change in Britain at the household and individual levels. Contain variables on citizenship and national identity, family, family networks and relationship with partner, local neighbourhood, harassment, social networks, groups and organisations, politics, news and media use, health and disability, life satisfaction, personality (Big 5) and cognitive ability, employment, income and benefits, discrimination (at work), household level variables
- ~46,000 rows x 700 columns
- Possible research questions:
- Which factors predict the general wellbeing of the UK population, cognitive ability, politics?
European Quality of Life Time Series, 2007 and 2011 (open version)
- The EQLS is a unique, pan-European survey that examines both the objective circumstances of European citizens' lives and how they feel about those circumstances and their lives in general. It looks at a range of issues, such as employment, income, education, housing, family, health and work-life balance. It also looks at subjective topics, such as people's levels of happiness, how satisfied they are with their lives, and how they perceive the quality of their societies.
- 79,270 rows x 195 variables (includes both waves in 2007 and 2011)
- Possible research questions (check this report):
- Which material circumstances and daily life experiences are connected to life satisfaction and well-being/health?
- How are social attitudes etc predicted by material circumstances and experience, e.g. in relation to housing quality or access to facilities and culture?
- Factors influencing income
- Differences between European countries? What patterns are visible?
Audit of Political Engagement (open)
- Rich individual survey with a large number of political attitudes questions (engagement, participation, attitudes to parliament, referendum, party politics, etc) and many demographics, socio-economic, material and other variables (e.g. newspaper of choice, owns TV and other devices, area, household characteristics)
- 1,771 rows x many columns
- Possible research questions:
- Probably not good for main dataset but could be used in taught sessions, asking questions like which factors drive certain attitudes and political participation.

Another interesting dataset that gives opportunities for visualisation work and also can pose some interesting questions about ethics and data privacy. It is open :

NYPD Stop, Frisk and Question data link:

Data collected by New York Police Department officers during stop question and frisk (SQF) encounters. Contain information on the officer's reasons for initiating a stop, officer rank, whether the stop led to a summons or arrest, demographic information, characteristics and description for the person stopped, suspected criminal behaviour, location and time.
Not clear if data can be redistributed in a repo but access is open.
~36,000 rows x 83 columns for the last 3 years
Possible research questions (check this study which is based on the same data):
- Are stop and frisks racially biased (maybe combined with other demographic data from NYC)?
- Which factors are predicting particularly intrusive questioning and frisking or other outcomes (e.g. arrest)?

Summary of large-scale survey datasets found so far:

Demographic and Health Survey (DHS) (safeguarded and open synthetic dataset)

Individual level and household level surveys from many countries. These are standardised and contain a lot of questions regarding housing characteristics, demographics, education, employment, fertility, marriage and sexual activity, family planning, child mortality, child health and nutrition, access to health services, women empowerment.
The data are freely available but require registration and a description of the project (i.e. safeguarded). They cannot be redistributed or shared with people that have not been approved so all students would have to go through the process. There is a synthetic version called "Model dataset" which contains realistic data but not from a particular country which is completely open to use and redistribute.
Sizes vary. E.g. the Bangladesh one is 20,000 rows with many columns.
Possible research questions could be similar to other survey data mentioned above, e.g. factors connected to fertility or other choices.

MICS surveys (list)

These are survey data which focus more on the situation of women and children in various countries. They are individual and household level. They contain questions about child mortality, nutrition, breastfeeding, vaccinations, various types of disease, water and sanitation, reproductive health, child development, literacy and education, child protection, sexual behaviour, access to mass media and ICT, subjective well-being, tobacco and alcohol use.
Sizes vary by country, e.g. Nigeria has ~35,000 rows and many columns.
Access is free but with similar restrictions to DHS - i.e. safeguarded with account creation, non-redistributable/shareable
Possible research questions:
- Similar to European Quality of Life time series, e.g. which factors predict life satisfaction and health, also more focus on questions around sexual health, birth control, family planning.

Living Standards Measurement Study Surveys (example)

Very similar to MICS and DHS in terms of content
Sizes very but large enough
Similar research questions.
WHO Global School-Based Student Health Survey
- Standardised global survey with 140+ fields, open access for several countries e.g. here). Datasets contain some demographic and characteristics data (age, sex, height, weight), questions about eating habits, personal hygiene, violent behaviour, bullying, smoking, alcohol, sexual behaviour, parents' involvement, drug use, school attendance, hunger, mental health. Also available are the Global Adult/Youth Tobacco Surveys.
- E.g. Indonesian data have ~11,000 rows x 140+ columns
- Data is open but redistribution is not allowed.
- Possible research questions:
- What factors predict violent behaviour, drug use or alcohol (e.g. parent behaviour, going to school, eating habits)

Quick summary:

The COMPAS dataset might be a bit tricky to use due to need to engage with a lot of stakeholders in the Turing and outside to make sure that it is legal, ethical and not distracting/risky to use for the course. The same might apply to the Stop and Frisk data despite their nice characteristics.

Given that, I think some of the UK Data Service datasets are good choices (especially European Quality of Life Time Series which is open and can be republished in github, plus it is rich and with some research questions already out there in existing publications). The British Crime Survey, National Survey of Sexual Attitudes and British Cohort studies are also nice but would have maximum value only if we use the safeguarded versions (which require a bit of overhead to allow them to be used in the course). A positive is that the UKDS has some infrastructure and process in place for giving access to these for use for teaching here.

Alternatively, the various large-scale survey data conducted by international organisations are also rich enough for our purposes and can pose interesting questions but they have similar issues with overhead for access like the UKDS (but without any provision for using for teaching). The model datasets offered by DHS are an exception as they are open to access and use in any way and apparently realistic (seems they are some form of synthetic dataset).

Finally, an option preferred by the Turing Commons team is to create our own synthetic dataset (details here) but it might not be as attractive for this course which wants to simulate a real-world data science project, plus it will need some extra work to prepare.

In the discussion today we decided to use the European Quality of Life Time Series (2007 and 2011) due to its rich content, many options of interesting research questions and open access.

In terms of the main research question, we want to pick one that:

Is representative of what actual researchers do with this data.
Ideally has some connection to policy or some type of decision making.
Allows us to teach the types of models and data science techniques that we want in Modules 3 and 4 (especially in the predictive modeling part) and to have an interesting and challenging hands-on session in Module 1 (which would involve discussing and refining the research question).

Some initial ideas:

Which segments of the population in the EU (or in each country) were more negatively affected by the financial crisis in 2008? The dataset's two waves in 2007 and 2011 let us explore this question. We could use different techniques in the modeling modules to answer it, e.g. predictive models that have different capacity, complexity and explainability (logistic regression, random forest, etc).
What is the relationship between labour market exclusion/insecurity and different measures of health and wellbeing? (similar to this). This would allow us to teach logistic/linear regression and other models but maybe also discuss causality briefly. It has numerous connections with policy decisions.
This study explores associations between socio-economic status (SES), measured using occupation, and self-reported health, using the same data. It examines the contribution of various material, occupational and psychosocial factors to social inequalities in health in Europe. It uses multilevel logistic regression. Policy connection could be which interventions might result in a reduction in social inequalities in health.
This study uses the same data to explore quality of life in the context of housing conditions. It reveals important differences in housing conditions across European countries, in particular, the basic divide running between the ‘old’ EU15 Member States and the 10 new Member States, along with Bulgaria, Romania and Turkey. It finds that, in addition to living space and standard of accommodation, quality of life is largely dependent on factors such as personal safety, proximity to local infrastructure and the quality of the environment such as clear water, clean air and green areas. Applies various regression models and has a lot of connections to policy.
Which neighbourhood characteristics (e.g. green space) can reduce socioeconomic health inequalities? This study uses the dataset we are using and employs multilevel regression in the analysis, which allowed clustering within regions and countries. It could be connected to policy decisions about how to modify public space.
Overview Publication
Publications using the dataset and other datasets from the same source here
@fedenanni to consult with social scientist collaborators about potential uses and research questions for this dataset.

We discussed the above questions with Chris Burr who thought they might be a good start. We particularly focused on question 3 which is linking SES and self-reported health, which is a widely researched topic (question 2 is not very far either).

Chris sent us this study which is very useful for starting a discussion about how SES/education and health are connected, what are the accepted and controversial causal relationships, etc. Look at figure 4 in particular for models of those relationships: .

Also, this article might support the same discussion.

alan-turing-institute / rds-course

Find main dataset #1