Support for categorical variables in regression

agisga commented 9 years ago

Categorical (as opposed to numeric) variables are ubiquitous in data analysis and linear regression, but they seem not to be supported by Statsample::Regression. Here is an example of what I mean:

In R, I can do:

> head(fake.salaries)
      salary years ethnicity
1  5.0823594     9     black
2 -0.4459633     3     black
3 16.0734587     2     white
4 10.5554305     7     other
5  9.9438798     8     other
6  9.6776724     6    latino
> mod <- lm(salary ~ years + ethnicity, fake.salaries)
> summary(mod)

Call:
lm(formula = salary ~ years + ethnicity, data = fake.salaries)

Residuals:
    Min      1Q  Median      3Q     Max 
-2.5068 -1.1283 -0.3713  1.1227  3.3027 

Coefficients:
                 Estimate Std. Error t value Pr(>|t|)    
(Intercept)        1.5421     0.9851   1.565    0.131    
years              0.1729     0.1561   1.108    0.279    
ethnicitylatino    6.7300     0.9984   6.741 5.67e-07 ***
ethnicitymexican   5.4826     0.8755   6.262 1.79e-06 ***
ethnicityother     6.6404     0.9034   7.351 1.37e-07 ***
ethnicitywhite    11.5310     0.9309  12.387 6.46e-12 ***

---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.66 on 24 degrees of freedom
Multiple R-squared:  0.8761,    Adjusted R-squared:  0.8503 
F-statistic: 33.95 on 5 and 24 DF,  p-value: 3.942e-10

We see that lm regards the variable "ethnicity" as a categorical variable and fits a model accordingly. We can see in the output that in this case it takes ethnicity "black" as the base level, and that all other ethnicities have a statistically significant effect on "salary" (with p-values of 1e-6 or smaller) when compared to the base level.

When I try to analyse the same data in Statsample:

pry(main)> df = Statsample::CSV.read("/home/alexej/Desktop/fake_salaries.csv")
=> #<Statsample::Dataset:69956503513460 @name=Dataset 1 @fields=[salary,years,ethnicity] cases=30
pry(main)> mod = Statsample::Regression.multiple(df, 'salary')
NoMethodError: NoMethodError
from /home/alexej/.rbenv/versions/2.2.2/lib/ruby/gems/2.2.0/gems/statsample-1.5.0/lib/statsample/vector.rb:186:in `_check_type'

So, "NoMethodError". And when I delete "ethinicity", the model can be fit:

pry(main)> df.delete_vector("ethnicity")
=> ["ethnicity"]
pry(main)> mod = Statsample::Regression.multiple(df, 'salary')
=> #<Statsample::Regression::Multiple::RubyEngine:0x007f4008733620
> puts mod.summary
= Multiple reggresion of years on salary
  Engine: Statsample::Regression::Multiple::RubyEngine
  Cases(listwise)=30(30)
  R=0.061
  R^2=0.004
  R^2 Adj=-0.032
  Std.Error R=4.358
  Equation=7.046 + 0.125years
  == ANOVA
    ANOVA Table
+------------+---------+----+--------+-------+-------+
|   source   |   ss    | df |   ms   |   f   |   p   |
+------------+---------+----+--------+-------+-------+
| Regression | 1.979   | 1  | 1.979  | 0.104 | 0.749 |
| Error      | 531.824 | 28 | 18.994 |       |       |
| Total      | 533.804 | 29 | 20.973 |       |       |
+------------+---------+----+--------+-------+-------+

  Beta coefficients
+----------+-------+-------+-------+-------+
|  coeff   |   b   | beta  |  se   |   t   |
+----------+-------+-------+-------+-------+
| Constant | 7.046 | -     | 2.233 | 3.155 |
| years    | 0.125 | 0.061 | 0.386 | 0.323 |
+----------+-------+-------+-------+-------+

This issue possibly allows for a common solution with https://github.com/SciRuby/statsample-glm/issues/11 and https://github.com/v0dro/daru/issues/9.

dansbits commented 8 years ago

+1 Has there been any progress on this?

v0dro commented 8 years ago

Yes @lokeshh is working on it as part of his GSOC project.

SciRuby / statsample

Support for categorical variables in regression #38