CODAIT / r4ml

Scalable R for Machine Learning
Apache License 2.0
42 stars 13 forks source link

[R4ML-196] fixing step.lm when intercept = TRUE #22

Closed iyounus closed 7 years ago

iyounus commented 7 years ago

fixing the issue where the steplm breaks when intercept is included in the model

Developer's Certificate of Origin 1.1

   By making a contribution to this project, I certify that:

   (a) The contribution was created in whole or in part by me and I
       have the right to submit it under the Apache License 2.0; or

   (b) The contribution is based upon previous work that, to the best
       of my knowledge, is covered under an appropriate open source
       license and I have the right under that license to submit that
       work with modifications, whether created in whole or in part
       by me, under the same open source license (unless I am
       permitted to submit under a different license), as indicated
       in the file; or

   (c) The contribution was provided directly to me by some other
       person who certified (a), (b) or (c) and I have not modified
       it.

   (d) I understand and agree that this project and the contribution
       are public and that a record of the contribution (including all
       personal information I submit with it, including my sign-off) is
       maintained indefinitely and may be redistributed consistent with
       this project or the open source license(s) involved.
R4ML-CI commented 7 years ago

Build triggered.

bdwyer2 commented 7 years ago

Maybe we should create a test case to check step.lm with intercept = TRUE?

R4ML-CI commented 7 years ago

Build success. All unit tests passed.

R4ML-CI commented 7 years ago

Build triggered.

R4ML-CI commented 7 years ago

Build success. All unit tests passed.

R4ML-CI commented 7 years ago

Build triggered.

R4ML-CI commented 7 years ago

Build success. All unit tests passed.

R4ML-CI commented 7 years ago

Build triggered.

R4ML-CI commented 7 years ago

Build success. All unit tests passed.

aloknsingh commented 7 years ago

@iyounus

As we talked, I think the code until first 3 commits are good. But for the last commit we have to change it. The reason being

1) we can't remove the case where intercept == 2, without understanding the code which we are removing. 2) seems like R doesn't have function to do the intercept ==2 case. 3) I understand that for case intercept=1 and intercept=0, the output are fine but I feel that commit 4 is introducing some changes which need not be there. 4) I Think the right approach would be to have until your 3 commits and then modify the code to reorder. 5) until commit 3, everything was working and I think those reorder could be done in the r4ml or in dml. If the dml code doesn't work, we can partly code in r4ml.

aloknsingh commented 7 years ago

@iyounus I think intecept=0 and intercept=1 ( I verified extensively) look ok and we are fine. but since for intercept=2 , we can't have the same betas but yes prediction has to be same. Anyways in the interest of this release. can you just disable the intercept =2 for now in the step.lm R code and when user gives shiftAndRescale option , you can return with the message that shiftAndRescale use the ml.preprocess and currently not supported. I think this will make this PR ready for this cut.

thanks for the detailed work.

iyounus commented 7 years ago

For intercept == 2, the shift and scale is done between lines 270-276. But once the betas are obtained, these need to adjusted for shifting/scaling. And that correction is being done in lines 298-299. So, from users perspective, the final betas produced by this function will be the same whether intercept is 2 or not.

Once the correct betas are calculated (on line 301 or 299), no further calculations are to done on betas. The only thing we need to do it to reorder. This reordering is done by the function reorder_matrix.

So, the code that I've removed was just doing some reordering so its no longer necessary.

I've tested this PR with intercept = 0,1,2 and all these cases give me correct results which match with R.

iyounus commented 7 years ago

We can just remove the parameter shiftAndScale from the function. In my opinion, this should not even be there. I understand this is some baggage from bigR, but I don't think we need to maintain this bad design forever.

aloknsingh commented 7 years ago

ideally yes we shouldn't have scale and shift and user can use preprocesser.

but biginsight is our biggest customer so far so lets have this for api

in ref to the code for intercept=2 1) I think i see this way that pred would be same but if we have to model y1==y2 bscale.dot.Xscale==bunschale.dot.Xunscale. we can't have bscale and bunscale to be same. 2) Also note that in the org dml the output spec clear stats that for intercept==2, we get this output

` icpt=2: ncol(X)+1 x 2 Col.1: betas for X & intercept Y ~ X %*% B[1:ncol(X), 1] + B[ncol(X)+1, 1] Col.2: betas for shifted/rescaled X and intercept

` and we are clearly not producing that format and also note col1 will be different than col2 (and can be verified running older script)

3)all we are doing is just doc, that intercept is not supported for now and small note in the dml src doc, will give reader/dev/engg to go for the next steps or other things in future. 4) we may or may not want to reword the following to have more beautification

"# NOTE: we have not accounted for intercept=2 and the code before was not matching so we have removed it for now. pl see the git revision history and diff to see the changes."

5) on the side note I think it is very easy to criticize on the bad design but there are reasons for it. and honestly I feel having the more options than R doesn't really quality it as a bad design.

iyounus commented 7 years ago

If we want to return both scaled and unscaled betas in one matrix, its very straight forward thing to do. I've no objection in providing that. But, do we handle that on the r4ml side? Does the r4ml coef function provide both coefficients?

aloknsingh commented 7 years ago

well seems like it is not straightforward as from the older code and the logic there. but yes from scratch it could be good and easier, but we have to verify it why we are removing the code (if different) and if different how do we verify with baseline. I think the best options (since monday noon is release cut date) would be to disable intercept=2 and fix it in future release for r4ml

aloknsingh commented 7 years ago

I will create one more PR on the top of this to disable intercept=2 with the proper messages.Also will add the comments in the dml as discussed in the previous comments.

frreiss commented 7 years ago

I don't see any problems with the logic here just looking at the code. @iyounus , can you please do some testing at larger scales to verify that these changes haven't broken the script in a way that doesn't show up in your small-scale testing so far? Pay special attention to whether the new implementation holds up under larger numbers of features.

Also, the script will have diverged significantly from the one in the SystemML algorithms folder by the time this PR settles. Can you please open up a JIRA to cover merging these changes back into the script that ships with SystemML?

aloknsingh commented 7 years ago

@frreiss @iyounus , the logic is fine for case intercept=0, intercept=1. I have verified it. However for interept==2 , it is not producing the output as planned. However, we can't just remove the case for intercept=2 and if we are removing we have let user know. So in short, It is just that we are disabling the case of intercept=2 and I have commented that we should make a note of it in the src code . I have also created a PR on the top of it at https://github.com/SparkTC/r4ml/pull/28 to account for r4ml disabling and dml comments

Also note that the current systemML dml has other issues also wrt the changes that we did earlier apart from these changes