Open drrben opened 3 years ago
For correlations, if you want we can also add the corrplot, it gives a little bit easier comparison: library(corrplot) withoutSex = new_abalone_train %>% select(!c(Sex)) corrplot(cor(withoutSex), method = "ellipse")
Regarding the two models we wrote, second one(with multiple weights) seems better with R2 0.57. If we add polynomial components, I tried up to degree 4, we reach R2 0.62 both for degree 3 and 4. Height is significant only on degree 1 and length on degree 2 and 3. For weights all degrees up to degree 3 are significant. So a model like y= height + length^2+length^3+weights+weights^2+weights^3 seem good for me. (weights I mean all three weights) And the residual curves do not move a lot between them so they are acceptable I think
For correlations, if you want we can also add the corrplot, it gives a little bit easier comparison: library(corrplot) withoutSex = new_abalone_train %>% select(!c(Sex)) corrplot(cor(withoutSex), method = "ellipse")
I added the corrplot lines
Regarding the two models we wrote, second one(with multiple weights) seems better with R2 0.57. If we add polynomial components, I tried up to degree 4, we reach R2 0.62 both for degree 3 and 4. Height is significant only on degree 1 and length on degree 2 and 3. For weights all degrees up to degree 3 are significant. So a model like y= height + length^2+length^3+weights+weights^2+weights^3 seem good for me. (weights I mean all three weights) And the residual curves do not move a lot between them so they are acceptable I think
Can you give me the adjusted R2 because if we add feature we must compare the adjusted R2
Regarding the two models we wrote, second one(with multiple weights) seems better with R2 0.57. If we add polynomial components, I tried up to degree 4, we reach R2 0.62 both for degree 3 and 4. Height is significant only on degree 1 and length on degree 2 and 3. For weights all degrees up to degree 3 are significant. So a model like y= height + length^2+length^3+weights+weights^2+weights^3 seem good for me. (weights I mean all three weights) And the residual curves do not move a lot between them so they are acceptable I think
Can you give me the adjusted R2 because if we add feature we must compare the adjusted R2
Yeap actually all R2 's that I wrote are the adjusted ones
Regarding the two models we wrote, second one(with multiple weights) seems better with R2 0.57. If we add polynomial components, I tried up to degree 4, we reach R2 0.62 both for degree 3 and 4. Height is significant only on degree 1 and length on degree 2 and 3. For weights all degrees up to degree 3 are significant. So a model like y= height + length^2+length^3+weights+weights^2+weights^3 seem good for me. (weights I mean all three weights) And the residual curves do not move a lot between them so they are acceptable I think
Can you give me the adjusted R2 because if we add feature we must compare the adjusted R2
Yeap actually all R2 's that I wrote are the adjusted ones
Ok nice! If that's ok with everyone we can add this model
I just have one thing. Are you sure putting multiple times the same covariate but with different degrees is a good thing?
Ps. question 7 i forgot to change linear_mod_log into linear_mod_log_simple. anyway we will try to run everything back before submitting
I just have one thing. Are you sure putting multiple times the same covariate but with different degrees is a good thing? I mean is ok that they are significant for all the degrees but I think just because the covariate is always significant, maybe we should just keep one degree for each variable and see which combination gives the best model
I just have one thing. Are you sure putting multiple times the same covariate but with different degrees is a good thing?
I think in some cases it is good, example: y= x1^2 - x1 + 3, you need to have both degree. But we have to be very cautious with this: to be honest I am not a huge fan of using polyniomal families
Yeah actually I don't know because maybe is not a linear model anymore if we use a polynomial (however, R adjusted is HUGE !). I didn't do better
try to remove lenght and use diameter, it returns better results for me in just the normal linear model
and also "weights" represent the whole weight? you didn't put the single weights?
and also "weights" represent the whole weight? you didn't put the single weights?
no its actually shell_weight + shell_weight^2+ shellweight^3 + shuck_weight + shuck_weight^2 + shuck_weight^3. Viscera weight wasn't significant.
Yeah actually I don't know because maybe is not a linear model anymore if we use a polynomial (however, R adjusted is HUGE !). I didn't do better
I think its still a linear model, important part is to have linear coefficients.
Viscera weight wasn't significant --> agreed
I think its still a linear model, important part is to have linear coefficients --> you are right, my bad
does it work for you if you put the exponents directly in the model formula ? (it doesn't for me and I have to modify the dataset)
anyway yeah add diameter, it shoud improve things more than lenght, try to use both but between the two diameter i think gives an higher R adjusted
I am also not very sure but look that was teachers motivation to use polynomial structure on partial correction:
This model looks valid too but there is still some "quadratic" structure that we did not catch.
We consider now the quadratic model log(Y)=β0+β1Height+β2Whole.weight+β3Height2+β4Sex+ξ.
@dorukyasa can you send me your code, so I can upload it for the rest of the group?
I think the one with polynomials is the best one, I am trying to improve it
and yes i think you should modify the dataset, it doesn't change anything to me if you put the exponentials just in the formula
linear_mod_log2 = lm(log_rings ~ poly(Height,4) + poly(Shuck_wt,4) + poly(Visc_wt,4) + poly(Shell_wt,4) + poly(Length,4), data=new_abalone_scale)
summary(linear_mod_log2)
plot(linear_mod_log2)
durbinWatsonTest(linear_mod_log2, max.lag=10)
acf(resid(linear_mod_log2))
bptest(linear_mod_log2)
shapiro.test(resid(linear_mod_log2))
summary(linear_mod_log2)
Used the new_abalone_scale
new_abalone_scale = rapply(new_abalone_train, scale, c("numeric","integer"), how="replace")
what does it do exactly?
It standardize the data
I don't think the model you sent is the correct one since it has viscera weight
In a second I send you the best model I could get along with R adjusted, also the postulates are EXTREMELY met (just autocorrelation I think has a problem)
I don't think the model you sent is the correct one since it has viscera weight
Well its from that codes summary that I see the less significant variables so when creating the last equation we remove the non significant variables
linear_mod_log2 = lm(log_rings ~ Height + poly(Shuck_wt,3) + poly(Shell_wt,3) + poly(Diameter,3), data=new_abalone_scale)
summary(linear_mod_log2)
plot(linear_mod_log2)
durbinWatsonTest(linear_mod_log2, max.lag=10)
acf(resid(linear_mod_log2))
bptest(linear_mod_log2)
shapiro.test(resid(linear_mod_log2))
summary(linear_mod_log2)
new_abalone_scale = rapply(new_abalone_train, scale, c("numeric","integer"), how="replace") new_abalone_scale$Shell_wt2=new_abalone_scale$Shell_wt^2 new_abalone_scale$Shell_wt3=new_abalone_scale$Shell_wt^3 new_abalone_scale$Shuck_wt2=new_abalone_scale$Shuck_wt^2 new_abalone_scale$Diameter2=new_abalone_scale$Diameter^2 new_abalone_scale$Diameter3=new_abalone_scale$Diameter^3
linear_mod_log = lm(log_rings ~ Height + Shell_wt + Shell_wt2+ Shell_wt3 + Shuck_wt + Shuck_wt2 + Diameter2 + Diameter3 + Length, data=new_abalone_scale) summary(linear_mod_log) plot(linear_mod_log) durbinWatsonTest(linear_mod_log, max.lag=10) acf(resid(linear_mod_log)) bptest(linear_mod_log) shapiro.test(resid(linear_mod_log))
ggplot(new_abalone_train, aes(x=Height, y=log_rings)) + geom_point(shape=1) +geom_smooth(method=lm)
This is the best model I got with all the combinations of polynomials up to degree 3:
Adjusted R-squared: 0.6203
If we keep length then the square of the diameter lose a little bit of significance but is exactly 0.05 so i decided to keep it to increse the R square (which decrease if you remove it)
you can check by the diagnostics how the postulates are met in a very good way
I repeat is just the autocorrelation, but also in our previous model we had this problem, the p-value of the test is always zero and there is always one line that trepass the threshold in the acf plot
I don't think it's bad, the treshold is kind of arbitrary and we do not trepass it by much
linear_mod_log2 = lm(log_rings ~ Height + poly(Shuck_wt,3) + poly(Shell_wt,3) + poly(Diameter,3) + Length, data=new_abalone_scale) summary(linear_mod_log2) plot(linear_mod_log2) durbinWatsonTest(linear_mod_log2, max.lag=10) acf(resid(linear_mod_log2)) bptest(linear_mod_log2) shapiro.test(resid(linear_mod_log2)) summary(linear_mod_log2)
this is a modification of benjamin model with length added: Adjusted R-squared: 0.621
I think we can put more models in the final file to show that we started with one and then tried to improve it
Yes I think that is what he wants
"Justify your choices (keep in mind that we want a practical method to predict number of rings)." what do you think of this sentence?
"Justify your choices (keep in mind that we want a practical method to predict number of rings)." what do you think of this sentence?
So we start with using only the whole weight which are total of the other three weights and only the length not diameter as they are so correlated. I think that is the practical intuitive start.
Then we try to use weight components to see if a particular one better explains the rings, then try it with polynomial components, see that they improve R2 etc...
yeah exactly, things such as "we decided to keep this regressor even if not as significant as the others according to the p-value because increased the R adjusted... and so on" we can put it in the comments after we put all the models until the last one
are we going to meet today right to finalize the file?
Since I guess we just have to put all the models in , comment them, and then work on other questions that really take 5 minutes besides question 12 (that I think will take a little bit more)
I
are we going to meet today right to finalize the file?
All the polytechniciens have a military ceremony from 19:30 to 22:00, so I won't be there at this time... :/ (we can meet before or after as you want)
For me before is ok like 17-18 if is ok with you
I'm trying to explain the process, how do you get rid of Length or Visc_wt ?
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.135e-15 1.205e-02 0.000 1.0000
Length 1.638e-01 7.624e-02 2.148 0.0318 *
Diameter 4.019e-01 7.790e-02 5.160 2.64e-07 ***
Height 3.573e-01 3.094e-02 11.550 < 2e-16 ***
Shuck_wt -7.058e-01 3.613e-02 -19.538 < 2e-16 ***
Visc_wt -7.955e-02 3.965e-02 -2.006 0.0449 *
Shell_wt 5.293e-01 3.476e-02 15.229 < 2e-16 ***
What do you mean? you just delete it from the formula of the model
If you mean why is because if It has an high p-value in initial model / when you transform the variables
Are you creating a file uniting everything?
[x] Look at the scatterplot of the data (GGally:ggpairs).
[x] Look for correlations betwwen predictors.
[x] Select some additional variables to add to the simple linear model of Part II in order to better predict number of rings. Justify your choices (keep in mind that we want a practical method to predict number of rings).
[x] Perform a multiple linear regression.
[x] Check the validity of the model. If validity conditions are not met, transform some variables, add/delete some variables and recheck until you find an acceptable model.