Closed iyounus closed 7 years ago
Build triggered.
Maybe we should create a test case to check step.lm with intercept = TRUE?
Build success. All unit tests passed.
Build triggered.
Build success. All unit tests passed.
Build triggered.
Build success. All unit tests passed.
Build triggered.
Build success. All unit tests passed.
@iyounus
As we talked, I think the code until first 3 commits are good. But for the last commit we have to change it. The reason being
1) we can't remove the case where intercept == 2, without understanding the code which we are removing. 2) seems like R doesn't have function to do the intercept ==2 case. 3) I understand that for case intercept=1 and intercept=0, the output are fine but I feel that commit 4 is introducing some changes which need not be there. 4) I Think the right approach would be to have until your 3 commits and then modify the code to reorder. 5) until commit 3, everything was working and I think those reorder could be done in the r4ml or in dml. If the dml code doesn't work, we can partly code in r4ml.
@iyounus I think intecept=0 and intercept=1 ( I verified extensively) look ok and we are fine. but since for intercept=2 , we can't have the same betas but yes prediction has to be same. Anyways in the interest of this release. can you just disable the intercept =2 for now in the step.lm R code and when user gives shiftAndRescale option , you can return with the message that shiftAndRescale use the ml.preprocess and currently not supported. I think this will make this PR ready for this cut.
thanks for the detailed work.
For intercept == 2, the shift and scale is done between lines 270-276. But once the betas are obtained, these need to adjusted for shifting/scaling. And that correction is being done in lines 298-299. So, from users perspective, the final betas produced by this function will be the same whether intercept is 2 or not.
Once the correct betas are calculated (on line 301 or 299), no further calculations are to done on betas. The only thing we need to do it to reorder. This reordering is done by the function reorder_matrix.
So, the code that I've removed was just doing some reordering so its no longer necessary.
I've tested this PR with intercept = 0,1,2 and all these cases give me correct results which match with R.
We can just remove the parameter shiftAndScale from the function. In my opinion, this should not even be there. I understand this is some baggage from bigR, but I don't think we need to maintain this bad design forever.
ideally yes we shouldn't have scale and shift and user can use preprocesser.
but biginsight is our biggest customer so far so lets have this for api
in ref to the code for intercept=2 1) I think i see this way that pred would be same but if we have to model y1==y2 bscale.dot.Xscale==bunschale.dot.Xunscale. we can't have bscale and bunscale to be same. 2) Also note that in the org dml the output spec clear stats that for intercept==2, we get this output
` icpt=2: ncol(X)+1 x 2 Col.1: betas for X & intercept Y ~ X %*% B[1:ncol(X), 1] + B[ncol(X)+1, 1] Col.2: betas for shifted/rescaled X and intercept
` and we are clearly not producing that format and also note col1 will be different than col2 (and can be verified running older script)
3)all we are doing is just doc, that intercept is not supported for now and small note in the dml src doc, will give reader/dev/engg to go for the next steps or other things in future. 4) we may or may not want to reword the following to have more beautification
"# NOTE: we have not accounted for intercept=2 and the code before was not matching so we have removed it for now. pl see the git revision history and diff to see the changes."
5) on the side note I think it is very easy to criticize on the bad design but there are reasons for it. and honestly I feel having the more options than R doesn't really quality it as a bad design.
If we want to return both scaled and unscaled betas in one matrix, its very straight forward thing to do. I've no objection in providing that. But, do we handle that on the r4ml side? Does the r4ml coef function provide both coefficients?
well seems like it is not straightforward as from the older code and the logic there. but yes from scratch it could be good and easier, but we have to verify it why we are removing the code (if different) and if different how do we verify with baseline. I think the best options (since monday noon is release cut date) would be to disable intercept=2 and fix it in future release for r4ml
I will create one more PR on the top of this to disable intercept=2 with the proper messages.Also will add the comments in the dml as discussed in the previous comments.
I don't see any problems with the logic here just looking at the code. @iyounus , can you please do some testing at larger scales to verify that these changes haven't broken the script in a way that doesn't show up in your small-scale testing so far? Pay special attention to whether the new implementation holds up under larger numbers of features.
Also, the script will have diverged significantly from the one in the SystemML algorithms folder by the time this PR settles. Can you please open up a JIRA to cover merging these changes back into the script that ships with SystemML?
@frreiss @iyounus , the logic is fine for case intercept=0, intercept=1. I have verified it. However for interept==2 , it is not producing the output as planned. However, we can't just remove the case for intercept=2 and if we are removing we have let user know. So in short, It is just that we are disabling the case of intercept=2 and I have commented that we should make a note of it in the src code . I have also created a PR on the top of it at https://github.com/SparkTC/r4ml/pull/28 to account for r4ml disabling and dml comments
Also note that the current systemML dml has other issues also wrt the changes that we did earlier apart from these changes
fixing the issue where the steplm breaks when intercept is included in the model
Developer's Certificate of Origin 1.1