Hands on - Githubissues

ChristinaLast commented 3 years ago

This PR addresses the following tasks:

[x] a jupyter notebook with code for demo analysis techniques
[x] explores dataset
- [x] Complete data loading
- [x] Complete data missingness
- [x] train a logistic regression model
[x] Feature importance graphs
[x] Adjust variable names
[x] Visualise patterns of missingness (is this discussed in module 2?).
[x] Relationships between numerical variables with scatter plots, joint plots, and pair plots.
[x] Count plots
[x] boxplots
[x] Relationships between numerical and categorical variables with box-and-whisker plots, complex conditional plots and heatmaps.

Follow-up tasks

[x] respond to comments in this PR and in the notebook

review-notebook-app[bot] commented 3 years ago

Check out this pull request on

See visual diffs & provide feedback on Jupyter Notebooks.

Powered by ReviewNB

crangelsmith commented 3 years ago

Hi @ChristinaLast , thank you for this. I have given it a quick read (today I'm focusing on my project), tomorrow morning I'll try to look in more detail. But I have a couple of questions:

Are you using all variables available as input to the classifier? Have you looked at their correlations with the target values somewhere? (Maybe you are working on it in the dataviz side?)
Can you add some feature importance figures to these? that can be useful.

ChristinaLast commented 3 years ago

Thanks, @crangelsmith! ✨ I am working on visualisation stuff tomorrow, and will definitely use that to refine the "generic" model pipeline here. And yes, I can add feature importance 📊 charts to the branch tomorrow.

crangelsmith commented 3 years ago

Hi @ChristinaLast, today I finally had some time to look at the notebook more in detail. I left some more comments in here, with some questions and other ideas that can help the reader to understand a bit better. In general, I have three main comments/things we need to think about:

There is a big imbalance in our classes, you chose to binarise the outcome variable as ["Very good", "Good" and "Fair"] and ["Bad", "Very Bad"]. That makes complete sense, the paper was vague it just says "the variable was dichotomised as ‘good’ health versus ‘poor’ health" but doesn't mention the exact grouping. Anyway, we need to be careful in how we deal with the imbalance because we risk only learning how to predict good health.
In module 4 we are also interested in discussing the strength of the predictors and their uncertainty. I mentioned before looking at feature importance, but I also think we want to look further at the coefficients. Here is an example of how to extract them.
Could we replace the variable names for the variable label on the data wrangling side? It will make it easier to understand later on when we look at coefficients and feature importance.

Thanks again, this is looking great and will make Module 4 so much easier to develop :)

ChristinaLast commented 3 years ago

Hi @ChristinaLast, today I finally had some time to look at the notebook more in detail. I left some more comments in here, with some questions and other ideas that can help the reader to understand a bit better. In general, I have three main comments/things we need to think about:

1. There is a big imbalance in our classes, you chose to binarise the outcome variable as ["Very good", "Good" and "Fair"] and ["Bad", "Very Bad"]. That makes complete sense, the paper was vague it just says "the variable was dichotomised as ‘good’ health versus ‘poor’ health" but doesn't mention the exact grouping. Anyway, we need to be careful in how we deal with the imbalance because we risk only learning how to predict good health.

2. In module 4 we are also interested in discussing the strength of the predictors and their uncertainty. I mentioned before looking at feature importance, but I also think we want to look further at the coefficients.  [Here is an example of how to extract them](https://stackoverflow.com/questions/57924484/finding-coefficients-for-logistic-regression-in-python).

3. Could we replace the variable names for the variable label on the data wrangling side? It will make it easier to understand later on when we look at coefficients and feature importance.

Thanks again, this is looking great and will make Module 4 so much easier to develop :)

Thanks @crangelsmith ✨ I have taken a look at the following comments:

I have attempted to address the class imbalance through oversampling the "poor health" class. This did not result in a big improvement in model performancy (accuracy and recall still similar, with slight improvements in precision). I am not using the imputed values which is probably why performance is worse.
I developed some content on extracting the coefficient_ and intercept_ froma model with only one predictor feature. Potentially students could subsitute the features in the Logistic regression model and then comment on the change in the coefficient and intercept.
I will replace variable names today

callummole commented 3 years ago

Just a quick on on the binary SRH. The paper says:

"In general, would you say your health is …” and response categories were “excellent”, “very good”, “good”, “fair”, and “poor”. The variable was dichotomised as “good” health versus “poor” health (“fair” and “poor”)."

So good -> [excellent, very good, good] and poor -> [fair, poor]. As Christina does. Though interestingly you have different names and in your categorisation "fair" jumps the boundary (which I think is semantically appropriate).

Will review the PR this morning

crangelsmith commented 3 years ago

Hi @ChristinaLast! Thank you for this, it looks great, is a very thorough and well-documented analysis. I have some questions and 2 requests if you have enough time today:

Could you upload somewhere the output of the plotting scripts you made? I would like to see them.
Do you understand why 'EQLS_Wave' shows as a good predictor? I thought we were only doing Wave 3, so this variable should be constant?
So the features with the highest predictive power seem to be health-related ones which is not a surprise, given that if you have a chronic illness you are more likely to report bad health. The issue here is that we are not really learning anything new, because these features are dominating the model. If you have enough time today could you run the model again removing all of the health-related features? (In the paper they focus more on occupational, family and socio-economical variables).

crangelsmith commented 3 years ago

Also, another request @ChristinaLast, would it be possible/easy to modify the data inputter to do a stratified imputation country and gender?

callummole commented 3 years ago

@ChristinaLast.

Thanks again for all of this. It's really useful.

I think one issue we have at the moment is including so many variables in the model. We don't know to what extent these variables are correlated with one another (in which case there will be redundancy in the predictors), and also I'm not sure how different the scales are for each predictor. Both these things make the coefficients difficult to interpret currently.

I am going to do some investigation of the variables today in the pursuit of finding 5-10 more or less independent variables that we can use to simplify the model building and interpretation. This might change the final two notebooks a little (though I'll be using all your code!).

To help with this, it might make sense if you concentrated on adapting the imputation today? If you do that maybe we could have a chat about the variables later, but this will be more of the knowledge-sharing variety rather than getting you to do further work.

ChristinaLast commented 3 years ago

Could you upload somewhere the output of the plotting scripts you made? I would like to see them. I have adjusted script when ran they generate plot and save to a plots dir. I am adding the plots to this comment.

Do you understand why 'EQLS_Wave' shows as a good predictor? I thought we were only doing Wave 3, so this variable should be constant? I am not sure about this as well, I removed from model.

So the features with the highest predictive power seem to be health-related ones which is not a surprise, given that if you have a chronic illness you are more likely to report bad health. The issue here is that we are not really learning anything new, because these features are dominating the model. If you have enough time today could you run the model again removing all of the health-related features? (In the paper they focus more on occupational, family and socio-economical variables). Yes, I removed the indirect health-related variables from the modelling. Now there is an interesting difference between Netherlands and Romania results, Romania self-reported health depending more on social time and familial contact, whereas Netherlands self-reported health more dependent on job security and finances. Both have availability of care during old age as an important factor.

alan-turing-institute / rds-course

Hands on #48

Follow-up tasks