Standardizing output features

jpfairbanks commented 6 years ago

Can we dived the feature values by the coefficients so that they are comparable with each other?

cjhutto commented 6 years ago

I've just completed data collection for 7-12 human judgments of perceived bias in ~1000 new samples of actual news stories (both journal news as well as Op-Ed articles), and am planning a hefty revision to the features. Will be doing a more rigorous job of making features more independent and distinct (merging and collapsing isomorphic features) as well as adding new features from a more thorough social science and computer science literature review I recently conducted -- including literature-backed considerations for uses of quotes. Am working on this as I can (in spare time) over the next few days...

So the impact and relevance to your question is that the coefficients may change once I re-evaluate their relative importance and do a better analysis -- the first analysis was a quick-and-dirty simple linear regression using the 14 "best" features (identified using step-wise regression and various other inspections). But given that the newly acquired data have a gamma distribution and do not follow rules of normal distribution (e.g., the prediction should never be negative for bias estimates -- i.e., the scale is continuous from 0 [unbiased] to 3 [extremely biased]), I am considering using a GLM Gamma Family regression model... but have also considered other advanced linear models such as Ridge, Lasso, PLS, or ElasticNet Regression. Taking input/suggestions/discussion!

jpfairbanks commented 6 years ago

Is it always true that for a GLM we can rescale the input features so that they are in a comparable space?

In the context of teaching journalists about how to remove bias from their language, we need to make sure that the features are explainable. One way to do that is to rescale them so that the largest/smallest values are the features with the most impact on why you see the regression result that you see.

In GLM parlance the linear predictor is an inner product of the features and the coefficients. If we just do the multiplication part and not the addition part then the "adjusted features" are just being summed, which is more intuitive for the user. As long as the link function is monotonic, then bigger means bigger and that means sense.

cjhutto commented 6 years ago

I can convert the unstandardized ('b') coefficients to standardized coefficients or beta ('β') coefficients, and report them both. Each type of coefficient has its advantage: The unstandardized b coefficients in are useful in that they can be directly interpreted according to the native units of each predictor: for each one unit change in the predictor/independent variable StdDev, the StdDevof the response/dependent variable is expected to change by the respective b coefficient (all else being equal). While this is valuable for a broad range of prediction and forecasting purposes, it seems we are also interested in comparing the relative impact of each predictor; so I should therefore report the standardize beta (β) coefficients as well... For most linear regression types, the absolute value of the standardized coefficient equals the correlation coefficient (I'll double check this for Gamma Family) - so that should also be fairly intuitive for a knowledgeable end user to understand.

jpfairbanks commented 6 years ago

That is about explaining the model, where you need to make the coefficients in the native units.

We kind of want to do the opposite, explain a particular prediction. "For this x why is it the prediction large?" Does large/small elements of x .* beta (elementwise product instead of dot product) explain that?

cjhutto commented 6 years ago

Ah - yes, I think I see what you're after. Good thought. I think yes, having a function to determine which factor(s) of a particular sentence (or overall article) had the largest impact on bias score adjustments would be handy for explanatory purposes, and would be computed via element-wise product (i.e., the x_feature_value * b_coefficient). Maybe list the top N (3-5?) contributing features in the results?

jpfairbanks commented 6 years ago

yeah I think that would be helpful.

cjhutto commented 6 years ago

Still working on this one, but am finding it useful for my investigation on the nature of bias in journalism vs op-ed news stories. Will expose these functions in a later release after retraining based on regressions from new data and then updating the empirical scoring (beta weights) from the new features.

jpfairbanks commented 6 years ago

So from a correctness perspective, is there anything wrong with just multiplying the feature values by the model coefficient? I can code it up if that isn't incorrect mathematically or statistically or psychologically.

jpfairbanks commented 6 years ago

https://github.com/cjhutto/bsd/blob/a5c72bcf3b0423acfbac034a4a62b739c60363bf/bsdetector/bias.py#L471

cjhutto / bsd

Standardizing output features #14