erlris / intergen_ml

Project on intergenerational mobility as a prediction problem
1 stars 0 forks source link

Big-picture paper plan #5

Open jackblun opened 6 years ago

jackblun commented 6 years ago

The goal of this issue is to discuss big picture ideas for how to frame the paper

jackblun commented 6 years ago

Here is one potential idea for how to proceed:

Separate the Norwegian and UK analyses into two different papers.

  1. In the UK, focus on this potential issue of measurement error, building on the existing literature on the topic. I think (hopefully) we have something to add here. The measurement error issue to me is a little confusing when presented with the other stuff in the current paper. This UK-only paper would be directed to a different audience, and speak to the debate between sociologists and economists on trends in the UK. I think provided this hasn’t been done before in some other way, our contribution will be publishable here. We can also consider the method as a more robust way to measure mobility when there is concern for measurement error. This would be our short, statistical paper. I think this would be feasible to do relatively quickly and put out for publication in places like the Oxford Bulletin of Economics and Statistics, or perhaps the EJ (although that might be a stretch). I’m happy to take the lead on it.
  2. For the Norwegian paper, work on something more formal. The data quality allows us to abstract from measurement error issues and talk more conceptually. This would be a broader paper than the above, and contribute more generally to the measurement of intergenerational mobility / equality of opportunity.
jackblun commented 6 years ago

In response to this justified criticism we are getting about setting up OLS to fail by throwing in useless predictors, to be clear I think we should have tables in which we incrementally add new sets of predictors, showing in the in and out of sample R2 for each set of predictors for linear regression, elastic net and then a non-linear model. This would help clarify exactly what is going on.

jackblun commented 6 years ago

So a slightly different suggestion for direction for the bigger Norwegian paper: Tagging.

This would be quite a big change, but something that I think would be much much easier to frame.

We know that overfit aside (less of an issue in admin data) machine learning tends to perform better than standard methods when predicting tails. See for example Mullainathan's judges work. So lets think about predicting the tail end of the income distribution (bottom 10% say). This would then be a classification model. The policy relevance would be clear - how well could 'early action' interventions target individuals at risk of having low future income?

I just saw Raj motivate some of his regional inequality of opportunity work by this reasoning. I think the following four questions would be interesting: 1) Is it the case that there are groups of individuals who, based on early-childhood observables, are very likely to fall into this group? 2) Does ML offer substantial gains over logit models? (Judging by Mullainathan et al, it hopefully would...) 3) What is the loss of targeting power when we restrict certain variables, for example allowing only regional vars, vs parental income etc. 4) How well does a model fit to say a 1970 birth cohort apply to a 1980 birth cohort? Raj is very interested in this type of question and I think this data would allow us to do a lot here. This is very relevant for actual policy implementation and I think in general tells us something about the value of all intergenerational mobility research.

We could even then do some very back-of-the-envelope policy experiments based on the targeting performance.

I'm quite excited about this and look forward to hearing thoughts! It wouldn't be possible with the survey data, however as we know there may be other more statistical issues we can discuss there.

jackblun commented 6 years ago

Another different possible avenue forward... Instead of focusing on the bottom end, focus on the top one percent. Aghion and co have a paper on the 'social origins and IQ of inventors' (attached). I think it would be very possible to do the same for top earners and somewhat surprisingly I haven't seen anything on this for top 1 percent earners. This is probably much more interesting since Norway has since quite a big growth in top 1 percent income share since 1980s.

Akcigit and Toivanen - The Social Origins and IQ of Inventors∗.pdf

jackblun commented 6 years ago

On the previous idea, I've created a lit review here: https://docs.google.com/document/d/13ToPQl_TE59M3RPsKJYEUh77xGL3rrtvkCIHZqWyipk/edit#

jackblun commented 6 years ago

So the Bertrand paper has actually been extremely useful. It puts into words a lot of things that have been quite fuzzy in my head. Great find!

In light of what I've read there, I think we could adjust the plan as follows: 1) Keep the description of R2 that we have at the moment as a motivating discussion of what measures of predictive fit actually pick up. I think this is quite nice at illustrating what drives predictive fit. 2) Proceed as discussed in the call, however also (or instead) do classification at the tails, as in Bertrand. This would involve keeping only the top and bottom quartile of earners for each cohort, and then classifying which of these two groups they end up in. By doing this restriction we should find much more predictive power, and we can motivate the measure additionally by saying it is more informative for the tails than other measures, which is probably more interesting than averages across the whole distribution. A simple logit to start with would be fine, and then lasso or something more complicated once this is working. 3) As a follow-up we could also do the same for education (degree / non-degree), using the re-weighting method from Bertrand so that we have equal group sizes for different samples.

What do you think of this as a plan? I was thinking that a useful thing to do would be to do a proper, full lit review as I did for the 1 percent stuff. I will also do the above for the UK survey data, but I think that the emphasis should be on the Norwegian data.

erlris commented 6 years ago

I like the idea of doing something similar to the Bertrand paper, and I think that we might be able to argue that the tails of the distribution are more interesting in the Norwegian/Scandinavian setting. I think one thing we should discuss is whether we want to do quartiles in the Norwegian data or going even further into the tails. My impression was that they use quartiles in the Bertrand paper to ensure sufficient sample sizes.

jackblun commented 6 years ago

In case you're interested, here's a poster I just put together for our (now feeling very outdated...!) project: intergenerational-mobility-poster (10).pdf

erlris commented 6 years ago

Looks good! Is this for next week?

jackblun commented 6 years ago

Yeah its for a poster presentation at the summer school next week.

On 5 Jul 2018, at 08:27, erlris notifications@github.com wrote:

Looks good! Is this for next week?

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/erlris/intergen_ml/issues/5#issuecomment-402630376, or mute the thread https://github.com/notifications/unsubscribe-auth/ANQ4uuLuoDYqe4lFKhTGWQItqLEhvId-ks5uDb_fgaJpZM4UY15k.

jackblun commented 6 years ago

So, I've been thinking and talking with people to find what they find interesting about what we are doing. I think that one way of getting further away from the equality of opportunity literature is to actually predict income mobility.

I've been experimenting with this in the UK data as follows: 1) Restrict to just the bottom quartile faminc individuals 2) fit model predicting whether or not individuals will make it to top quartile

This is going to be much more about looking at what the individual predictors are rather than using the predictive fit as a measure, and is much closer to what chetty and co do.

I think the huge advantage we have over the chetty work is that they rely alot on using neighborhood rather than individual characteristics. Based on their analysis, its not clear how much of what they find is neighborhood vs individual. With the Norwegian data, we have much more detailed data (i.e. education, wealth, family stuff) on individuals.

jackblun commented 6 years ago

Following on from the comment I made yesterday in the 'useful comments from seminars etc' issues, I really think it would be valuable to produce some tables like this in the regression framework:

screen shot 2018-07-07 at 22 22 04

Doing this for sets of different models with sets of different predictors would be really informative. Note in this table (Spiess Mullainathan JEP) the very marginal improvement in total R2, but big improvements at different parts of the distribution.

It would be really interesting to do this first just with income, then adding incrementally different predictors (probably mostly just wealth and education). Including just income in one of our tree-based methods would no doubt deliver better performance at the top tail, as we know there are non-linearities there, so this would more-or-less just be testing something we can eyeball. However, more interesting are questions such as: Is it the case that wealth is more relevant at one of the tails than the other (I imagine almost entirely top tail), and for which tail is education more informative for? The way to pitch this would be not that we are creating a new overall measure, but rather we are trying to build descriptive facts which could motivate further causal models, particularly at the tails.

@erlris, what do you think about this? This type of thing is unlikely to work well in the UK data, since it is so noisy.

jackblun commented 6 years ago

in relation to the above - this paper is highly relevant: https://www.microsoft.com/en-us/research/wp-content/uploads/2017/06/submission_kleinbergliangmullainathan.pdf

erlris commented 6 years ago

I was in Oslo over the weekend so I was unable to look at it until now. I think doing something like this might make for some nice supplementary results, I'll add it to my to-do list. And the paper looks very interesting, nice find!

jackblun commented 6 years ago

A bit of a sideline, but I was talking to a mathematician about this the other day and we got talking about different ways to measure informativeness, i.e. what we are trying to get at with the R^2. That got me thinking about Entropy, and more broadly information theory. I am completely clueless about information theory and have a very loose understanding of entropy, but I think we are looking at similar ideas. Might at least be worth mentioning in the paper... https://en.wikipedia.org/wiki/Entropy_(information_theory)

jackblun commented 5 years ago

Hey @erlris I've been going over the latest results and considering what the main points of the latest draft will be. Do you agree with the following? Anything to add?

  1. We could change the title to something like "Earnings and family background: Are we using the right measures?". I feel that signalling early on that what we are really doing is a measurement paper might be useful.
  2. At the national level, a simple rank-rank model captures a large proportion of the explainable variation in child earnings.
  3. At the national level, adding wealth and education length to earnings improves the predictive fit. Adding information on family structure, geography, income source, occupation, education type does not yield significantly more predictive power. In this sense then, it seems like focusing on family education length, earnings and wealth makes sense if one is in general interested in the link between family background and child earnings.
  4. At the regional level, ranking areas based on rank-rank estimates aligns quite well with the total explanatory power of family background.
  5. At the regional level, comparing regions based on the relationship between parental education and child earnings yields similar results to those of rank-rank, but the same is not true if one looks at the influence of parental wealth.
erlris commented 5 years ago
  1. I like this idea.

  2. Agree

  3. I agree for the most part, although there seems to be a small improvement if we look at the cross validation performance. As long as we are clear on what we mean by significant I think it's ok.

  4. Agree

  5. Agree, but here I think we might need to discuss the properties of the wealth measure. I'm thinking about the fact that the distribution is highly towards zero etc.

jackblun commented 5 years ago

To follow up...

  1. Agreed with the wealth measure. However, we show that at the national level there is quite a bit of additional predictive power, so we know the measure is not nonsense.

I'll get started on re-working the write-up when I have a minute. Sorry, its quite a chaotic time at the moment!

erlris commented 5 years ago

Yes, there is definitely information in the wealth measure. No worries, I know the feeling. Let me know if anything comes up.