Question 9 - Githubissues

drrben commented 3 years ago

[x] Look at the scatterplot of the data (GGally:ggpairs).
[x] Look for correlations betwwen predictors.
[x] Select some additional variables to add to the simple linear model of Part II in order to better predict number of rings. Justify your choices (keep in mind that we want a practical method to predict number of rings).
[x] Perform a multiple linear regression.
[x] Check the validity of the model. If validity conditions are not met, transform some variables, add/delete some variables and recheck until you find an acceptable model.

dorukyasa commented 3 years ago

For correlations, if you want we can also add the corrplot, it gives a little bit easier comparison: library(corrplot) withoutSex = new_abalone_train %>% select(!c(Sex)) corrplot(cor(withoutSex), method = "ellipse")

dorukyasa commented 3 years ago

Regarding the two models we wrote, second one(with multiple weights) seems better with R2 0.57. If we add polynomial components, I tried up to degree 4, we reach R2 0.62 both for degree 3 and 4. Height is significant only on degree 1 and length on degree 2 and 3. For weights all degrees up to degree 3 are significant. So a model like y= height + length^2+length^3+weights+weights^2+weights^3 seem good for me. (weights I mean all three weights) And the residual curves do not move a lot between them so they are acceptable I think

drrben commented 3 years ago

For correlations, if you want we can also add the corrplot, it gives a little bit easier comparison: library(corrplot) withoutSex = new_abalone_train %>% select(!c(Sex)) corrplot(cor(withoutSex), method = "ellipse")

I added the corrplot lines

drrben commented 3 years ago

Regarding the two models we wrote, second one(with multiple weights) seems better with R2 0.57. If we add polynomial components, I tried up to degree 4, we reach R2 0.62 both for degree 3 and 4. Height is significant only on degree 1 and length on degree 2 and 3. For weights all degrees up to degree 3 are significant. So a model like y= height + length^2+length^3+weights+weights^2+weights^3 seem good for me. (weights I mean all three weights) And the residual curves do not move a lot between them so they are acceptable I think

Can you give me the adjusted R2 because if we add feature we must compare the adjusted R2

dorukyasa commented 3 years ago

Regarding the two models we wrote, second one(with multiple weights) seems better with R2 0.57. If we add polynomial components, I tried up to degree 4, we reach R2 0.62 both for degree 3 and 4. Height is significant only on degree 1 and length on degree 2 and 3. For weights all degrees up to degree 3 are significant. So a model like y= height + length^2+length^3+weights+weights^2+weights^3 seem good for me. (weights I mean all three weights) And the residual curves do not move a lot between them so they are acceptable I think

Can you give me the adjusted R2 because if we add feature we must compare the adjusted R2

Yeap actually all R2 's that I wrote are the adjusted ones

drrben commented 3 years ago

Regarding the two models we wrote, second one(with multiple weights) seems better with R2 0.57. If we add polynomial components, I tried up to degree 4, we reach R2 0.62 both for degree 3 and 4. Height is significant only on degree 1 and length on degree 2 and 3. For weights all degrees up to degree 3 are significant. So a model like y= height + length^2+length^3+weights+weights^2+weights^3 seem good for me. (weights I mean all three weights) And the residual curves do not move a lot between them so they are acceptable I think

Can you give me the adjusted R2 because if we add feature we must compare the adjusted R2

Yeap actually all R2 's that I wrote are the adjusted ones

Ok nice! If that's ok with everyone we can add this model

lorenzopepe999 commented 3 years ago

I just have one thing. Are you sure putting multiple times the same covariate but with different degrees is a good thing?

lorenzopepe999 commented 3 years ago

Ps. question 7 i forgot to change linear_mod_log into linear_mod_log_simple. anyway we will try to run everything back before submitting

lorenzopepe999 commented 3 years ago

I just have one thing. Are you sure putting multiple times the same covariate but with different degrees is a good thing? I mean is ok that they are significant for all the degrees but I think just because the covariate is always significant, maybe we should just keep one degree for each variable and see which combination gives the best model

drrben commented 3 years ago

I just have one thing. Are you sure putting multiple times the same covariate but with different degrees is a good thing?

I think in some cases it is good, example: y= x1^2 - x1 + 3, you need to have both degree. But we have to be very cautious with this: to be honest I am not a huge fan of using polyniomal families

lorenzopepe999 commented 3 years ago

Yeah actually I don't know because maybe is not a linear model anymore if we use a polynomial (however, R adjusted is HUGE !). I didn't do better

lorenzopepe999 commented 3 years ago

try to remove lenght and use diameter, it returns better results for me in just the normal linear model

lorenzopepe999 commented 3 years ago

and also "weights" represent the whole weight? you didn't put the single weights?

dorukyasa commented 3 years ago

and also "weights" represent the whole weight? you didn't put the single weights?

no its actually shell_weight + shell_weight^2+ shellweight^3 + shuck_weight + shuck_weight^2 + shuck_weight^3. Viscera weight wasn't significant.

dorukyasa commented 3 years ago

Yeah actually I don't know because maybe is not a linear model anymore if we use a polynomial (however, R adjusted is HUGE !). I didn't do better

I think its still a linear model, important part is to have linear coefficients.

lorenzopepe999 commented 3 years ago

Viscera weight wasn't significant --> agreed

lorenzopepe999 commented 3 years ago

I think its still a linear model, important part is to have linear coefficients --> you are right, my bad

lorenzopepe999 commented 3 years ago

does it work for you if you put the exponents directly in the model formula ? (it doesn't for me and I have to modify the dataset)

lorenzopepe999 commented 3 years ago

anyway yeah add diameter, it shoud improve things more than lenght, try to use both but between the two diameter i think gives an higher R adjusted

dorukyasa commented 3 years ago

I am also not very sure but look that was teachers motivation to use polynomial structure on partial correction:

This model looks valid too but there is still some "quadratic" structure that we did not catch.

We consider now the quadratic model log(Y)=β0+β1Height+β2Whole.weight+β3Height2+β4Sex+ξ.

drrben commented 3 years ago

@dorukyasa can you send me your code, so I can upload it for the rest of the group?

lorenzopepe999 commented 3 years ago

I think the one with polynomials is the best one, I am trying to improve it

lorenzopepe999 commented 3 years ago

and yes i think you should modify the dataset, it doesn't change anything to me if you put the exponentials just in the formula

dorukyasa commented 3 years ago

linear_mod_log2 = lm(log_rings ~ poly(Height,4) + poly(Shuck_wt,4) + poly(Visc_wt,4) + poly(Shell_wt,4) + poly(Length,4), data=new_abalone_scale)
summary(linear_mod_log2)
plot(linear_mod_log2)
durbinWatsonTest(linear_mod_log2, max.lag=10)
acf(resid(linear_mod_log2))
bptest(linear_mod_log2)
shapiro.test(resid(linear_mod_log2))
summary(linear_mod_log2)

dorukyasa commented 3 years ago

Used the new_abalone_scale

new_abalone_scale = rapply(new_abalone_train, scale, c("numeric","integer"), how="replace")

what does it do exactly?

lorenzopepe999 commented 3 years ago

It standardize the data

lorenzopepe999 commented 3 years ago

I don't think the model you sent is the correct one since it has viscera weight

lorenzopepe999 commented 3 years ago

In a second I send you the best model I could get along with R adjusted, also the postulates are EXTREMELY met (just autocorrelation I think has a problem)

dorukyasa commented 3 years ago

I don't think the model you sent is the correct one since it has viscera weight

Well its from that codes summary that I see the less significant variables so when creating the last equation we remove the non significant variables

drrben commented 3 years ago

linear_mod_log2 = lm(log_rings ~ Height + poly(Shuck_wt,3) + poly(Shell_wt,3) + poly(Diameter,3), data=new_abalone_scale)
summary(linear_mod_log2)
plot(linear_mod_log2)
durbinWatsonTest(linear_mod_log2, max.lag=10)
acf(resid(linear_mod_log2))
bptest(linear_mod_log2)
shapiro.test(resid(linear_mod_log2))
summary(linear_mod_log2)

lorenzopepe999 commented 3 years ago

new_abalone_scale = rapply(new_abalone_train, scale, c("numeric","integer"), how="replace") new_abalone_scale$Shell_wt2=new_abalone_scale$Shell_wt^2 new_abalone_scale$Shell_wt3=new_abalone_scale$Shell_wt^3 new_abalone_scale$Shuck_wt2=new_abalone_scale$Shuck_wt^2 new_abalone_scale$Diameter2=new_abalone_scale$Diameter^2 new_abalone_scale$Diameter3=new_abalone_scale$Diameter^3

linear_mod_log = lm(log_rings ~ Height + Shell_wt + Shell_wt2+ Shell_wt3 + Shuck_wt + Shuck_wt2 + Diameter2 + Diameter3 + Length, data=new_abalone_scale) summary(linear_mod_log) plot(linear_mod_log) durbinWatsonTest(linear_mod_log, max.lag=10) acf(resid(linear_mod_log)) bptest(linear_mod_log) shapiro.test(resid(linear_mod_log))

ggplot(new_abalone_train, aes(x=Height, y=log_rings)) + geom_point(shape=1) +geom_smooth(method=lm)

lorenzopepe999 commented 3 years ago

This is the best model I got with all the combinations of polynomials up to degree 3:

Adjusted R-squared: 0.6203

If we keep length then the square of the diameter lose a little bit of significance but is exactly 0.05 so i decided to keep it to increse the R square (which decrease if you remove it)

lorenzopepe999 commented 3 years ago

you can check by the diagnostics how the postulates are met in a very good way

lorenzopepe999 commented 3 years ago

I repeat is just the autocorrelation, but also in our previous model we had this problem, the p-value of the test is always zero and there is always one line that trepass the threshold in the acf plot

drrben commented 3 years ago

I don't think it's bad, the treshold is kind of arbitrary and we do not trepass it by much

lorenzopepe999 commented 3 years ago

linear_mod_log2 = lm(log_rings ~ Height + poly(Shuck_wt,3) + poly(Shell_wt,3) + poly(Diameter,3) + Length, data=new_abalone_scale) summary(linear_mod_log2) plot(linear_mod_log2) durbinWatsonTest(linear_mod_log2, max.lag=10) acf(resid(linear_mod_log2)) bptest(linear_mod_log2) shapiro.test(resid(linear_mod_log2)) summary(linear_mod_log2)

lorenzopepe999 commented 3 years ago

this is a modification of benjamin model with length added: Adjusted R-squared: 0.621

lorenzopepe999 commented 3 years ago

I think we can put more models in the final file to show that we started with one and then tried to improve it

drrben commented 3 years ago

Yes I think that is what he wants

drrben commented 3 years ago

"Justify your choices (keep in mind that we want a practical method to predict number of rings)." what do you think of this sentence?

dorukyasa commented 3 years ago

"Justify your choices (keep in mind that we want a practical method to predict number of rings)." what do you think of this sentence?

So we start with using only the whole weight which are total of the other three weights and only the length not diameter as they are so correlated. I think that is the practical intuitive start.

Then we try to use weight components to see if a particular one better explains the rings, then try it with polynomial components, see that they improve R2 etc...

lorenzopepe999 commented 3 years ago

yeah exactly, things such as "we decided to keep this regressor even if not as significant as the others according to the p-value because increased the R adjusted... and so on" we can put it in the comments after we put all the models until the last one

lorenzopepe999 commented 3 years ago

are we going to meet today right to finalize the file?

lorenzopepe999 commented 3 years ago

Since I guess we just have to put all the models in , comment them, and then work on other questions that really take 5 minutes besides question 12 (that I think will take a little bit more)

drrben commented 3 years ago

I

are we going to meet today right to finalize the file?

All the polytechniciens have a military ceremony from 19:30 to 22:00, so I won't be there at this time... :/ (we can meet before or after as you want)

lorenzopepe999 commented 3 years ago

For me before is ok like 17-18 if is ok with you

drrben commented 3 years ago

I'm trying to explain the process, how do you get rid of Length or Visc_wt ?

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  1.135e-15  1.205e-02   0.000   1.0000    
Length       1.638e-01  7.624e-02   2.148   0.0318 *  
Diameter     4.019e-01  7.790e-02   5.160 2.64e-07 ***
Height       3.573e-01  3.094e-02  11.550  < 2e-16 ***
Shuck_wt    -7.058e-01  3.613e-02 -19.538  < 2e-16 ***
Visc_wt     -7.955e-02  3.965e-02  -2.006   0.0449 *  
Shell_wt     5.293e-01  3.476e-02  15.229  < 2e-16 ***

lorenzopepe999 commented 3 years ago

What do you mean? you just delete it from the formula of the model

lorenzopepe999 commented 3 years ago

If you mean why is because if It has an high p-value in initial model / when you transform the variables

lorenzopepe999 commented 3 years ago

Are you creating a file uniting everything?

drrben / project_regression

Question 9 #1