Open cmcaine opened 6 years ago
Cool! Is there a way to evaluate the difference between using the different years for change_since, and how much choosing different years affects the results?
If you run ipython -i brellenge_prep.py
and then enter exploreUK()
or exploreEngland()
at the command prompt then the code will try to fit models for all 400 options and will return you a list of tuples that are (since, till, OLS Model, OLS Model.fit(), Adjusted R squared). We could do a 3D or coloured scatter plot to explore the space if we wanted.
I found that just the years 2004-2006 give an adjusted r squared of .34 on their own. Other three year intervals are quite a lot worse, so there's probably something in it. Kudos to @malteserteresa for suggesting that interval.
Going backwards is a lot better for most pairs, I suspect because I normalise by dividing the difference betweeen population fractions by the population fraction of the first year given. By dividing by the later year I think I'm encoding some more of the information about the relative populations of that later year. Which means I'm getting better information on the population makeup during the year that's closest to the vote.
Heatmap of the output of exploreUK():
White means the model didn't converge, darker colours are better.
Can clearly see that backwards is better and that change in immigration before 2003 is pretty much useless.
Regression model features only change_since(since, till) and 2015 white british population.
Code to reproduce:
In [14]: res = exploreUK()
/usr/lib/python3.6/site-packages/numpy/linalg/linalg.py:1647: RuntimeWarning: invalid value encountered in greater
return count_nonzero(S > tol, axis=-1)
In [15]: seaborn.heatmap(res.drop(columns=['ols', 'ols.fit']).pivot('since', 'till', 'rsq'), squ
...: are=True, cmap="YlGnBu")
Out[15]: <matplotlib.axes._subplots.AxesSubplot at 0x7ff0700b7240>
In [16]: plt.show()
Comparison of UK regression and England only regression (England has IMD as well)
Would your conclusions be that only post 2003 change is influential? And that change in Asian and Black population representation was influential where this spiked within short time periods (if the significance appears and reappears as we adjust the time window)? Just trying to get my head around the outputs (I haven't come across this type of analysis before)
I'd say the interesting things about these heatmaps are:
1) It's really important to go backwards, i.e. normalise the change by dividing by the more recent year 2) Normalising by a year < 2003 is so bad most models don't converge 2) Change from X back to 2001 is the best column 3) All of the best models include the period 2003-2013
I can't say anything about Asian or Black population change from those heatmaps, but (log( 1 / Asian pvalue) is correlated with rsq as shown below:
For Black population:
And here are some heatmaps showing heatmaps for the Asian data. First one is coloured by the log of (1 / p value). log(1/ 0.05) == 2.99. Next is binary on pvalue['Asian'] < 0.05:
Best regressions I've found so far for England and for the whole of the UK. Years picked for change_since are chosen by a for loop. Details in source, but it's pretty simple.
England benefits from IMD data slightly, but also just from excluding the other areas: