ankane / eps

Machine learning for Ruby
MIT License
643 stars 15 forks source link

erroneous results when using categorical variables with linear regression algorithm #12

Closed mb52089 closed 4 years ago

mb52089 commented 4 years ago

We have a categorical variable for day_of_week as one of 4 independent variables in our model. The LightGBM algorithm works correctly but when I force the model to use the linear regression algorithm, the resultant prediction is incorrect. If I subsequently remove the categorical variable, the linear regression algorithm gives an accurate prediction. Here's an example of what our data set looks like:

{:day_of_service_util=>0.80952380952381, :day_in_advance_util=>0.714285714285714, :block_minutes=>420.0, :week_day=>"Fri"}, {:day_of_service_util=>0.69047619047619, :day_in_advance_util=>0.214285714285714, :block_minutes=>420.0, :week_day=>"Mon"}, {:day_of_service_util=>0.80952380952381, :day_in_advance_util=>0.238095238095238, :block_minutes=>420.0, :week_day=>"Mon"}, {:day_of_service_util=>0.80952380952381, :day_in_advance_util=>0.238095238095238, :block_minutes=>420.0, :week_day=>"Mon"}

day_of_service_util is the Target dependent variable.

Thanks for this great gem!

mb52089 commented 4 years ago

correction: 3 independent variables and 1 dependent variable, not 4 independent variables.

ankane commented 4 years ago

Hey @mb52089, thanks for the report. Can you give more details about what you mean by "incorrect" prediction? It's different than linear regression in another language, or the error is high?

mb52089 commented 4 years ago

Thanks Andrew. We're predicting the % utilization of a resource on the day of service based on the % utilization x days in advance, the duration of the resource in minutes and the day of the week. The predicted value should be between 0 and 1. In the particular test example we're using, the predicted value should be around 66%. We get that value when we use the lightgbm algorithm, but when we use the linear regression we get -1.4 which is a value that doesn't make sense giving the context and the training data. However, if I remove the "day of week" categorical variable and re-run the prediction using the linear regression algorithm, I get a prediction in range. I wasn't sure if the gem deals with categorical variables differently in the linear regression than in the lightGBM algorithm. The data set has around 150 rows of independent variables.

mb52089 commented 4 years ago

and this is all done in ruby/rails.

ankane commented 4 years ago

If it's not too sensitive, paste the model summary and PMML here or send it to me over email (on my GitHub profile)?

puts model.summary
puts model.to_pmml
mb52089 commented 4 years ago

I just ran the model summary for the error condition:

Math::DomainError: Numerical argument is out of domain - "sqrt" from /Users/michaelburke/.rvm/gems/ruby-2.6.5@copient_health_rails6/bundler/gems/eps-509da754d6e9/lib/eps/linear_regression.rb:186:in `sqrt' [4] pry(main)>

mb52089 commented 4 years ago

The model summary after I remove the categorical variable week_day: => "Validation RMSE: 0.14\n\n coef p\n_intercept 0.42 0.094\nday_in_advance_util 0.54 0.000\nblock_minutes -0.00 0.932\n\nadjusted r2: 0.330\n"

mb52089 commented 4 years ago

just sent to your chartkick email. I didn't know you were the author of chartkick. It's great too!

ankane commented 4 years ago

To close the loop: the issue was likely related to multicollinearity, which can produce an unstable solution (the link provides a good explanation). One way to counteract this is to use GSL, which uses a different algorithm to produce a more stable solution.

ankane commented 4 years ago

Going to reopen this until the model.summary error is fixed. @mb52089, can you paste the output of:

model.send(:diagonal)

for a model where you're seeing Math::DomainError: Numerical argument is out of domain - "sqrt"?

mb52089 commented 4 years ago

when I try to run model.send(:diagonal) I get the following error:

NoMethodError: undefined method `diagonal' for

from /Users/michaelburke/.rvm/gems/ruby-2.6.5@copient_health_rails6/bundler/gems/eps-509da754d6e9/lib/eps/model.rb:62:in `method_missing'

On Wed, Dec 4, 2019 at 10:59 PM Andrew Kane notifications@github.com wrote:

Going to reopen this until the model.summary error is fixed. @mb52089 https://github.com/mb52089, can you paste the output of:

model.send(:diagonal)

for a model where you're seeing Math::DomainError: Numerical argument is out of domain - "sqrt"?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ankane/eps/issues/12?email_source=notifications&email_token=AANV5YRRW6QLC7NLSHU6LYLQXB4DBA5CNFSM4JU2BCJKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEF7MX4I#issuecomment-561957873, or unsubscribe https://github.com/notifications/unsubscribe-auth/AANV5YUGBUE7WHVOFWQWUSDQXB4DBANCNFSM4JU2BCJA .

-- Michael Burke 404.271.8652 LinkedIn https://www.linkedin.com/in/michael-burke-6418681/

ankane commented 4 years ago

My bad, it should be:

model.instance_variable_get("@estimator").send(:diagonal)
mb52089 commented 4 years ago

Here you go:

=> [0.0005296860721842933, 0.0066308112665816495, 1.3595352803866229e-09, 0.0012121905646438054, 0.0312576935042156, 0.014730636756303176]

On Thu, Dec 5, 2019 at 6:47 AM Andrew Kane notifications@github.com wrote:

My bad, it should be:

model.instance_variable_get("@estimator").send(:diagonal)

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ankane/eps/issues/12?email_source=notifications&email_token=AANV5YSRW6MRRKAQSJKBW63QXDS4JA5CNFSM4JU2BCJKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEGAOMEI#issuecomment-562095633, or unsubscribe https://github.com/notifications/unsubscribe-auth/AANV5YVDGL3TKXKADTJHBITQXDS4JANCNFSM4JU2BCJA .

-- Michael Burke 404.271.8652 LinkedIn https://www.linkedin.com/in/michael-burke-6418681/

ankane commented 4 years ago

Thanks. This is from the model that errors on the summary? I'm unable to reproduce with those numbers.

mb52089 commented 4 years ago

Now that I have installed GSL, I can't seem to reproduce the error when I do the linear regression. Do you want me to uninstall GSL and see if I can reproduce?

ankane commented 4 years ago

Yeah, GSL changes the code path, so you'll want to recreate the initial conditions.

mb52089 commented 4 years ago

Here you go. After removing the gsl gem and re-bundling, I ran @model.instance_variable_get("@estimator").send(:diagonal) from a model that generated the following error when running @model.summary: Math::DomainError: Numerical argument is out of domain - "sqrt". Here's the output:

[-666372359695044.8, 1.0761875986711336, -3777621086.706599, 0.19673979554666882, -339985897803588.5, 2.390797741078714]

ankane commented 4 years ago

Thanks @mb52089, fixed the error message for unstable solutions. Pushing out a new release in a few with all the fixes we discussed. Thanks for the help!

mb52089 commented 4 years ago

No problem at all. Thanks for all the great gems!