Open ld-archer opened 1 year ago
Rob found this paper on the multinomial fused lasso, which is a method that has been used to determine how coefficients trend over time in longitudinal data and could be used to predict their future state. Seems a good place to start (after plotting the coefficients).
Think this is being superceded by #248 so I'll just post the coefficient plots for hh_income as an illustration.
Seems quite clear from these that some outliers are very influential in determining the coefficients, and also that the nice trend we were hoping for is probably a bit less nice than we wanted. Hopefully fused lasso will provide insight / be the solution. ALSO note these are diff models, not mean regenerative.
This job_sector plot is not great, but most likely can be explained by the fact that 0 is a bit of a missing code that encapsulates anyone who does not work in either the public or private sector, or has missing for either. That could mean retired, unemployed, long-term disabled, incredibly wealthy and not required to work, or any other reason. Perhaps if we go forward with it we could try to combine this with another variable, or add more levels, or some other clever way of separating it.
Some notes from meeting with WS7 team (5/5/23).
Firstly, this is not cross-validation but just validation of outcomes. Second, using yearly models in this sense is a bit disingenuous as we are not showing how our models generalise to data in a different temporal space, but rather that a model fit to a single year (timestep) can effectively predict the next timestep. There were other reasons mentioned as to why this is not quite right but I'll skip those for my sanity.
What we can do instead is to investigate the change in model coefficients over time, and depending on those results we can decide how best to proceed.
If the coefficients are relatively stable, and therefore effectively time-invariant, we can assume that a one year model in the way we have been working up until now is suitable to predict the outcome into the future, so we can use a model from any year (or all years??).
If the coefficients are not stable but follow a trend over time, we can fit a model to the coefficients and try to predict the future state of the model coefficients. Using this model of model coefficients, we can estimate what the model coefficients will be at the next timestep, which we can then use to predict the response variable at that timestep, with some confidence that our model takes into account temporal trends in the underlying data.
If the coefficients are not stable and also arbitrary (as in they bounce all over the place with no discernible pattern), then we're screwed and its back to the drawing board, or lets find some new data...
STEPS: