Answering the RQ with the European QoL data

We are tight on time, and much of the content creation for each modules does not rely on the EQOL data.

However, one of the most important aspects of the course is the development of the research question (#15). Based on Aldabe et al., 2011, the RQ starts broad, is scoped in M1 (#25), with candidate variables selected in one of M1-M3, is explored wrt the dataset in M3 (#24), and then is assessed via simple models in M4 (#30). This thread will make some of the taught material, but importantly the majority of the hands-on sessions.

There is a need for work to be done to get to know the dataset and how it can be used to answer the RQ. The code developed and insights gained will be massively useful for both the taught material and scoping appropriate hands-on tasks. This work can be done roughly in parallel to creating the taught content. to The task consists of (but is not limited to):

[ ] retrieving and loading the dataset
[ ] understanding variable naming, recoding if needed
[ ] uncovering any issues with cleaning the dataset
[ ] with reference to Aldabe et al., 2011, select a subset of variables that are useful. Let us restrict ourselves to UK data to simplify things
[ ] some exploratory plotting and analysis of useful variables to assess relationships
[ ] a simple model that replicates some of the findings of Aldabe et al. (2011). For example, a logistic regression model that shoe-horns self-reported health into a binary measure (which is what Aldabe et al., 2011 does).
[ ] If all of the above is done we can think about exploring more complicated models (e.g. multi-level, or regression models that do not lose the original 5-point scale of self-reported health)

It would be great to use pandas, numpy, matplotlib/seaborn, and scikit-learn. These are common packages that we will be using throughout other modules.

In terms of workflow, please branch off develop.

It is also expected that the nature of the work will evolve as the taught material is developed and more knowledge is gained about the dataset.

alan-turing-institute / rds-course

Answering the RQ with the European QoL data #41