alan-turing-institute / rds-course

Materials for Turing's Research Data Science course
https://alan-turing-institute.github.io/rds-course/
31 stars 13 forks source link

Find main dataset #1

Closed gmingas closed 3 years ago

gmingas commented 3 years ago

Candidates:

gmingas commented 3 years ago

Initial discussion prioritised the COMPAS dataset with living standards/census as a second choice. We still have not examined QUIPP, other ProPublica data, Biobank, Born in Bradford and Turing data stories.

We are in touch with EAG to request initial feedback on COMPAS dataset, specifically around ethics, legality and potential reputational damage. We have also contacted ProPublica to clarify licensing for the dataset.

Martin's feedback was that we should make sure context of what we are trying to do is provided in the publicly available material and that he would like to have Kirstie's opinion. He also said some health or genetics data which is consented and public might be a good alternative, e.g. UK Biobank.

gmingas commented 3 years ago

A couple of other ideas for datasets:

gmingas commented 3 years ago

Discussion with Turing Commons (see #7 ): It was proposed that we could create an artificial dataset with specified biases, correlations etc based on understanding of an area, e.g. healthcare, criminal justice in collaboration with domain experts. This would allow us to introduce the types of characteristics we want but would not be a real-world dataset and might take some time to build.

If we decide to do this, a possible tool to use for healthcare data is synthea

gmingas commented 3 years ago

Some open/safeguarded datasets are available from the UK Data Service (some of them are designed for teaching and include guides, documentation and in some cases possible questions for students to answer using the data).

Safeguarded datasets would require all student to accept the terms of this agreement and then download the dataset manually and also the instructor should create a project in the UK Data Service website and explain how the dataset will be used. We cannot redistribute the data (e.g. put them in a repository). Open datasets are under Open Government License (see here) and would allow us to redistribute, edit, etc the data.

List of most interesting datasets for the RDS course (look at the documentation tabs in each link for details):

gmingas commented 3 years ago

Another interesting dataset that gives opportunities for visualisation work and also can pose some interesting questions about ethics and data privacy. It is open :

NYPD Stop, Frisk and Question data link:

gmingas commented 3 years ago

Summary of large-scale survey datasets found so far:

Demographic and Health Survey (DHS) (safeguarded and open synthetic dataset)

MICS surveys (list)

Living Standards Measurement Study Surveys (example)

gmingas commented 3 years ago

Quick summary:

The COMPAS dataset might be a bit tricky to use due to need to engage with a lot of stakeholders in the Turing and outside to make sure that it is legal, ethical and not distracting/risky to use for the course. The same might apply to the Stop and Frisk data despite their nice characteristics.

Given that, I think some of the UK Data Service datasets are good choices (especially European Quality of Life Time Series which is open and can be republished in github, plus it is rich and with some research questions already out there in existing publications). The British Crime Survey, National Survey of Sexual Attitudes and British Cohort studies are also nice but would have maximum value only if we use the safeguarded versions (which require a bit of overhead to allow them to be used in the course). A positive is that the UKDS has some infrastructure and process in place for giving access to these for use for teaching here.

Alternatively, the various large-scale survey data conducted by international organisations are also rich enough for our purposes and can pose interesting questions but they have similar issues with overhead for access like the UKDS (but without any provision for using for teaching). The model datasets offered by DHS are an exception as they are open to access and use in any way and apparently realistic (seems they are some form of synthetic dataset).

Finally, an option preferred by the Turing Commons team is to create our own synthetic dataset (details here) but it might not be as attractive for this course which wants to simulate a real-world data science project, plus it will need some extra work to prepare.

gmingas commented 3 years ago

In the discussion today we decided to use the European Quality of Life Time Series (2007 and 2011) due to its rich content, many options of interesting research questions and open access.

In terms of the main research question, we want to pick one that:

Some initial ideas:

gmingas commented 3 years ago

We discussed the above questions with Chris Burr who thought they might be a good start. We particularly focused on question 3 which is linking SES and self-reported health, which is a widely researched topic (question 2 is not very far either).

Chris sent us this study which is very useful for starting a discussion about how SES/education and health are connected, what are the accepted and controversial causal relationships, etc. Look at figure 4 in particular for models of those relationships: image.

Also, this article might support the same discussion.