Categorical (as opposed to numeric) variables are ubiquitous in data analysis and linear regression, but they seem not to be supported by Statsample::Regression.
Here is an example of what I mean:
In R, I can do:
> head(fake.salaries)
salary years ethnicity
1 5.0823594 9 black
2 -0.4459633 3 black
3 16.0734587 2 white
4 10.5554305 7 other
5 9.9438798 8 other
6 9.6776724 6 latino
> mod <- lm(salary ~ years + ethnicity, fake.salaries)
> summary(mod)
Call:
lm(formula = salary ~ years + ethnicity, data = fake.salaries)
Residuals:
Min 1Q Median 3Q Max
-2.5068 -1.1283 -0.3713 1.1227 3.3027
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.5421 0.9851 1.565 0.131
years 0.1729 0.1561 1.108 0.279
ethnicitylatino 6.7300 0.9984 6.741 5.67e-07 ***
ethnicitymexican 5.4826 0.8755 6.262 1.79e-06 ***
ethnicityother 6.6404 0.9034 7.351 1.37e-07 ***
ethnicitywhite 11.5310 0.9309 12.387 6.46e-12 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1.66 on 24 degrees of freedom
Multiple R-squared: 0.8761, Adjusted R-squared: 0.8503
F-statistic: 33.95 on 5 and 24 DF, p-value: 3.942e-10
We see that lm regards the variable "ethnicity" as a categorical variable and fits a model accordingly. We can see in the output that in this case it takes ethnicity "black" as the base level, and that all other ethnicities have a statistically significant effect on "salary" (with p-values of 1e-6 or smaller) when compared to the base level.
When I try to analyse the same data in Statsample:
pry(main)> df = Statsample::CSV.read("/home/alexej/Desktop/fake_salaries.csv")
=> #<Statsample::Dataset:69956503513460 @name=Dataset 1 @fields=[salary,years,ethnicity] cases=30
pry(main)> mod = Statsample::Regression.multiple(df, 'salary')
NoMethodError: NoMethodError
from /home/alexej/.rbenv/versions/2.2.2/lib/ruby/gems/2.2.0/gems/statsample-1.5.0/lib/statsample/vector.rb:186:in `_check_type'
So, "NoMethodError". And when I delete "ethinicity", the model can be fit:
pry(main)> df.delete_vector("ethnicity")
=> ["ethnicity"]
pry(main)> mod = Statsample::Regression.multiple(df, 'salary')
=> #<Statsample::Regression::Multiple::RubyEngine:0x007f4008733620
> puts mod.summary
= Multiple reggresion of years on salary
Engine: Statsample::Regression::Multiple::RubyEngine
Cases(listwise)=30(30)
R=0.061
R^2=0.004
R^2 Adj=-0.032
Std.Error R=4.358
Equation=7.046 + 0.125years
== ANOVA
ANOVA Table
+------------+---------+----+--------+-------+-------+
| source | ss | df | ms | f | p |
+------------+---------+----+--------+-------+-------+
| Regression | 1.979 | 1 | 1.979 | 0.104 | 0.749 |
| Error | 531.824 | 28 | 18.994 | | |
| Total | 533.804 | 29 | 20.973 | | |
+------------+---------+----+--------+-------+-------+
Beta coefficients
+----------+-------+-------+-------+-------+
| coeff | b | beta | se | t |
+----------+-------+-------+-------+-------+
| Constant | 7.046 | - | 2.233 | 3.155 |
| years | 0.125 | 0.061 | 0.386 | 0.323 |
+----------+-------+-------+-------+-------+
Categorical (as opposed to numeric) variables are ubiquitous in data analysis and linear regression, but they seem not to be supported by
Statsample::Regression
. Here is an example of what I mean:In R, I can do:
We see that
lm
regards the variable "ethnicity" as a categorical variable and fits a model accordingly. We can see in the output that in this case it takes ethnicity "black" as the base level, and that all other ethnicities have a statistically significant effect on "salary" (with p-values of 1e-6 or smaller) when compared to the base level.When I try to analyse the same data in Statsample:
So, "NoMethodError". And when I delete "ethinicity", the model can be fit:
This issue possibly allows for a common solution with https://github.com/SciRuby/statsample-glm/issues/11 and https://github.com/v0dro/daru/issues/9.