RUMgroup / brexit_data_challenge

R and Python codes used in the analysis for the shortlisted R Uni of Manchester & MMU R User Group submission to the CDRC GISRUK 2018 Brexit Data Challenge
2 stars 2 forks source link

Best regressions? #10

Open cmcaine opened 6 years ago

cmcaine commented 6 years ago

Best regressions I've found so far for England and for the whole of the UK. Years picked for change_since are chosen by a for loop. Details in source, but it's pretty simple.

England benefits from IMD data slightly, but also just from excluding the other areas:

# ipython -i brellenge_prep.py

# bestEng = smf.ols('Pct_Remain ~ Q("White British") + Q("White Other") + Asian + Black + Other + y2015_WBR + IMD', data=change_since(2016, 2001).join(metadata))
# bestUK = smf.ols('Pct_Remain ~ Q("White British") + Q("White Other") + Asian + Black + Other + y2015_WBR', data=change_since(2017, 2003).join(metadata))

In [1]: bestEng.fit().summary2()
Out[1]: 
<class 'statsmodels.iolib.summary2.Summary'>
"""
                   Results: Ordinary least squares
======================================================================
Model:                OLS               Adj. R-squared:      0.624    
Dependent Variable:   Pct_Remain        AIC:                 2109.8669
Date:                 2018-03-04 15:19  BIC:                 2140.1375
No. Observations:     325               Log-Likelihood:      -1046.9  
Df Model:             7                 F-statistic:         77.81    
Df Residuals:         317               Prob (F-statistic):  4.59e-65 
R-squared:            0.632             Scale:               37.702   
----------------------------------------------------------------------
                    Coef.   Std.Err.    t    P>|t|    [0.025   0.975] 
----------------------------------------------------------------------
Intercept          123.1056  14.3254  8.5935 0.0000   94.9207 151.2905
Q("White British") -65.0829  19.6060 -3.3195 0.0010 -103.6572 -26.5086
Q("White Other")     9.7304   4.2564  2.2861 0.0229    1.3561  18.1046
Asian               -9.4880   2.8771 -3.2977 0.0011  -15.1487  -3.8273
Black               46.3033   4.4566 10.3899 0.0000   37.5351  55.0715
Other               -0.8730   4.9839 -0.1752 0.8611  -10.6786   8.9326
y2015_WBR          -48.5744  18.5010 -2.6255 0.0091  -84.9746 -12.1742
IMD                  1.4646   0.2204  6.6449 0.0000    1.0310   1.8983
----------------------------------------------------------------------
Omnibus:               12.016         Durbin-Watson:            1.845 
Prob(Omnibus):         0.002          Jarque-Bera (JB):         12.578
Skew:                  0.418          Prob(JB):                 0.002 
Kurtosis:              3.478          Condition No.:            572   
======================================================================

"""

In [2]: bestUK.fit().summary2()
Out[2]: 
<class 'statsmodels.iolib.summary2.Summary'>
"""
                   Results: Ordinary least squares
=====================================================================
Model:               OLS               Adj. R-squared:      0.547    
Dependent Variable:  Pct_Remain        AIC:                 2300.3152
Date:                2018-03-04 15:19  BIC:                 2327.2605
No. Observations:    347               Log-Likelihood:      -1143.2  
Df Model:            6                 F-statistic:         70.53    
Df Residuals:        340               Prob (F-statistic):  9.16e-57 
R-squared:           0.555             Scale:               43.437   
---------------------------------------------------------------------
                    Coef.   Std.Err.    t    P>|t|   [0.025   0.975] 
---------------------------------------------------------------------
Intercept          119.1428  13.0029  9.1628 0.0000  93.5666 144.7191
Q("White British") -55.5533  20.2003 -2.7501 0.0063 -95.2866 -15.8199
Q("White Other")    30.5484   3.9682  7.6984 0.0000  22.7432  38.3537
Asian               -9.0141   2.6544 -3.3959 0.0008 -14.2351  -3.7930
Black               24.8085   3.5454  6.9974 0.0000  17.8348  31.7822
Other                4.5733   4.9236  0.9289 0.3536  -5.1112  14.2578
y2015_WBR          -35.2234  15.8739 -2.2190 0.0271 -66.4467  -4.0000
---------------------------------------------------------------------
Omnibus:                3.247         Durbin-Watson:            1.823
Prob(Omnibus):          0.197         Jarque-Bera (JB):         2.962
Skew:                   0.207         Prob(JB):                 0.227
Kurtosis:               3.181         Condition No.:            148  
=====================================================================

"""
maczokni commented 6 years ago

Cool! Is there a way to evaluate the difference between using the different years for change_since, and how much choosing different years affects the results?

cmcaine commented 6 years ago

If you run ipython -i brellenge_prep.py and then enter exploreUK() or exploreEngland() at the command prompt then the code will try to fit models for all 400 options and will return you a list of tuples that are (since, till, OLS Model, OLS Model.fit(), Adjusted R squared). We could do a 3D or coloured scatter plot to explore the space if we wanted.

I found that just the years 2004-2006 give an adjusted r squared of .34 on their own. Other three year intervals are quite a lot worse, so there's probably something in it. Kudos to @malteserteresa for suggesting that interval.

Going backwards is a lot better for most pairs, I suspect because I normalise by dividing the difference betweeen population fractions by the population fraction of the first year given. By dividing by the later year I think I'm encoding some more of the information about the relative populations of that later year. Which means I'm getting better information on the population makeup during the year that's closest to the vote.

cmcaine commented 6 years ago

Heatmap of the output of exploreUK():

exploreuk_heatmap

White means the model didn't converge, darker colours are better.

Can clearly see that backwards is better and that change in immigration before 2003 is pretty much useless.

Regression model features only change_since(since, till) and 2015 white british population.

Code to reproduce:

In [14]: res = exploreUK()
/usr/lib/python3.6/site-packages/numpy/linalg/linalg.py:1647: RuntimeWarning: invalid value encountered in greater
  return count_nonzero(S > tol, axis=-1)

In [15]: seaborn.heatmap(res.drop(columns=['ols', 'ols.fit']).pivot('since', 'till', 'rsq'), squ
    ...: are=True, cmap="YlGnBu")
Out[15]: <matplotlib.axes._subplots.AxesSubplot at 0x7ff0700b7240>

In [16]: plt.show()
cmcaine commented 6 years ago

explore_compare_uk_england_heatmap

Comparison of UK regression and England only regression (England has IMD as well)

HeatherARobinson commented 6 years ago

Would your conclusions be that only post 2003 change is influential? And that change in Asian and Black population representation was influential where this spiked within short time periods (if the significance appears and reappears as we adjust the time window)? Just trying to get my head around the outputs (I haven't come across this type of analysis before)

cmcaine commented 6 years ago

I'd say the interesting things about these heatmaps are:

1) It's really important to go backwards, i.e. normalise the change by dividing by the more recent year 2) Normalising by a year < 2003 is so bad most models don't converge 2) Change from X back to 2001 is the best column 3) All of the best models include the period 2003-2013

I can't say anything about Asian or Black population change from those heatmaps, but (log( 1 / Asian pvalue) is correlated with rsq as shown below:

exploreuk_asian_log_pvalue_against_rsq

For Black population:

exploreuk_black_log_pvalue_against_rsq

And here are some heatmaps showing heatmaps for the Asian data. First one is coloured by the log of (1 / p value). log(1/ 0.05) == 2.99. Next is binary on pvalue['Asian'] < 0.05:

exploreuk_asian_pvalue_heatmap exploreuk_asian_pvalue_heatmap_binary