ankane / eps

Machine learning for Ruby
MIT License
643 stars 15 forks source link

Possible edge case issue with intercept and RMSE calculated value #14

Closed mb52089 closed 4 years ago

mb52089 commented 4 years ago

Hi Andrew -

I have 1596 models that are all variations of training data derived from a single data set. I'm using Eps Linear Regression with GSL to build the models. The RMSE for all of the models is within my expected range (almost zero up to 0.398) EXCEPT for 6 of the models (out of 21) from a single user (user is one of the independent variables). These 6 models have ridiculous RMSE values like 4072322534930.

I have attached datasets in yml format in the attached zip file if you want to build a model and look at the RMSE values. The "good" example should build a model with an RMSE in my accepted range. The "bad" example will build a model with a ridiculously high RMSE. The target is the first column of data.

Archive.zip

I've reviewed the training data set and the raw data from which it was calculated, and there don't appear to be any outliers or red flags. I can't find any meaningful differences between the good and bad yaml files. The other odd thing is that even with the ridiculous RMSE (and intercept) values, the models still predict well.

Any thoughts on why the RMSE is messed up for this "bad" data set?

ankane commented 4 years ago

Hey @mb52089, my guess is it's due to multicollinearity. The article outlines a few ways to remedy it. One is to use a regularized regression like ridge. The GSLR gem supports ridge regression so you could try using that directly. I may add support for it to Eps at some point.

mb52089 commented 4 years ago

I just re-created all the models using lightgbm.  The RMSE values that were in question are fine using lightgbm.  

DISCLAIMER: I'm not a statistician or a real data scientist.

The 1596 models are so closely related, have such overlap and share the same predictors since they are derived from a single aggregate data set (especially the 21 models from the affected user) that I'm having a hard time understanding how multicollinearity could be a problem in just 6 of the 21 models for this user, and not affect this user's other models, or any of the other 1575 models from other users. If there is colinearity it seems I should see it in all similar models that share most of the same underlying data and all of the same predictors.

But the affected models still predict well, even with the wacky RMSE, so I guess I'll just monitor the situation for now.

ankane commented 4 years ago

I don't think it's an issue with Eps, so going to close. fwiw, I'd recommend building a single model vs individual models for users so the model can learn from all users (unless a business constraint prohibits it).

mb52089 commented 4 years ago

Thanks Andrew. The multiple models are a long story... The user is a categorical variable predictor, so we include data for all users in each model, but also regress against the categorical variable user as a predictor/independent variable. Hence the need for multiple models.

On Fri, Dec 20, 2019 at 3:46 PM Andrew Kane notifications@github.com wrote:

I don't think it's an issue with Eps, so going to close. fwiw, I'd recommend building a single model vs individual models for users so the model can learn from all users (unless a business constraint prohibits it).

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ankane/eps/issues/14?email_source=notifications&email_token=AANV5YUGT4XFAJFAGDS2VHDQZUVJLA5CNFSM4J6BUTV2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEHODYWQ#issuecomment-568081498, or unsubscribe https://github.com/notifications/unsubscribe-auth/AANV5YUKXSW3ONAJMXN3IPLQZUVJLANCNFSM4J6BUTVQ .

-- Michael Burke 404.271.8652 LinkedIn https://www.linkedin.com/in/michael-burke-6418681/