anujkhare / iregnet

7 stars 12 forks source link

cv.iregnet #54

Closed tdhock closed 5 years ago

tdhock commented 7 years ago

this PR will implement the cv.iregnet function

To implement that function, I want to be able to K-fold cross-validation to estimate the best lambda parameter for prediction (as in glmnet).

To do that, I first use iregnet on the entire data set, to get the lambda grid. After that, I would like to specify that same lambda grid for each of the calls to iregnet for the K train sets.

However, the unreg_sol argument seems to be ignored when lambda is manually specified, which results in a model fit object with a lambda grid that contains 0 (which is not the same as the manually specified lambda grid).

I added a few tests that fail on my machine. @anujkhare any ideas about how to fix?

anujkhare commented 7 years ago

Hey @tdhock, the lambda values differ because of a shift-by-1 bug that crept in when I removed the initial (lambda=1e35) solution from the returned list. The lambda=0 is also because of the same.

The unreg_sol parameter is only used when we calculate lambda grid, which is not the case if we supply lambda. I have updated the documentation to reflect this.

A test now fails only one model should have L1 arclength 0. This is because the lambda is not calculated anymore, and so this guarantee doesn't hold.

anujkhare commented 7 years ago

For some reason some validation tests are also failing on Travis, but not on my machine (even on fresh checkout). I will check that later today.

tdhock commented 7 years ago

great, thanks for the fix.

yes I noticed about those validation tests failing on travis, I'm not sure why either. they are all passing on my machine. maybe some temporary problem with travis?

anujkhare commented 7 years ago

I tested iregnet on a fresh docker image. The errors do not appear there. Must be a problem with Travis.

tdhock commented 7 years ago

wow, you made a docker image for testing! great!

I googled the error message we are getting on Travis and I got http://en.cppreference.com/w/cpp/string/basic_string/resize which makes me think it has something to do with Rcpp -- @eddelbuettel have you ever seen this error before?

tdhock commented 7 years ago

to clarify, we are using testthat::expect_error and instead of the error message we expect, we are getting something about basic_string::resize which I suspect has something to do with Rcpp...

1. Failure: Output y is validated properly (@test_validation.R#13) -------------
error$message does not match "Invalid interval*".
Actual value: "basic_string::resize"

2. Failure: Output y is validated properly (@test_validation.R#14) -------------
error$message does not match "Invalid interval*".
Actual value: "basic_string::resize"

3. Failure: Fit parameters are validated properly (@test_validation.R#35) ------
error$message does not match "lambdas must be positive and decreasing.".
Actual value: "basic_string::resize"

4. Failure: Fit parameters are validated properly (@test_validation.R#36) ------
error$message does not match "lambdas must be positive and decreasing.".
Actual value: "basic_string::resize"
eddelbuettel commented 7 years ago

@tdhock Yup. Turns out @jimhester's cleanup to exceptions works on the exception messages we tested (with the compilers and proper std::exceptions code) but not on the combination of older compilers (as used at Travis) AND using Rcpp::stop().

This is now fixed, thanks to a quick and very nice cleanup by @jimhester -- but only in our github master.

I will probably cut an interim release you can then get from this drat repo for Rcpp but it may take me a day or two to do so.

tdhock commented 7 years ago

Thanks for the update @eddelbuettel

So @anujkhare we are safe to ignore these errors.

eddelbuettel commented 7 years ago

It is our bug, but until a new version comes out (or you point Travis to newer code) the best you can on these older system is to use, say, throw std::range_error("Some Msg") instead of Rcpp::stop("Some Msg"). This works for me on 14.04:

> Rcpp::cppFunction('bool foo() { throw std::range_error("Ooops"); return false; }')
> foo()
Error in foo() : Ooops 
> Rcpp::cppFunction('bool bar() { stop("Will not work"); return false; }')
> bar()
Error in bar() : basic_string::resize
> 

Sorry about the bug. It slipped our tests. We did fix Rcpp::stop() for newer compilers though :-/

jimhester commented 7 years ago

Until the new Rcpp is released one thing you can do is add a new remote to your package DESCRIPTION, which will automatically install the devel Rcpp (with the fix) on travis and when using devtools install functions.

Remotes:
  RcppCore/Rcpp
eddelbuettel commented 7 years ago

What Jim says is absolutely true and may help you -- but I personally very strongly recommend against use of topline GH repos.

There is a reason we have a release process. It usually catches these things. By taking what are random snapshots you are increasing and not decreasing the chance of getting bitten in the rear. I prefer releases, and I will make an interim release available as I said earlier. Of course, YMMV here. It is nice to have choices. I generally prefer not to play Russian roulette with a random revolver I do not control.

Again, sorry about the bug.

tdhock commented 7 years ago

TODO add print and predict methods for cv.iregnet

tdhock commented 7 years ago

Hey @anujkhare there are some real data sets where all of the labels in the train set are open intervals, either (-Inf, a) or (b, Inf) for some finite real numbers a,b. This should not be a problem (the optimization problem is well-defined), but iregnet is giving me a "NANs produced" error somewhere in the C++ code. I suppose that would be a result of dividing by zero --- do you have any idea where to look to fix this?

tdhock commented 7 years ago

Hey @anujkhare I'm back to work after the winter holidays, and I have started working on a tutorial proposal about change-point detection for useR2017 https://github.com/tdhock/change-tutorial/blob/master/Submission.md

I would like to present how the iregnet package (and the cv.iregnet function in particular) can be used for supervised change-point detection, on the https://cran.r-project.org/package=neuroblastoma data set (which has only open intervals). Any idea how to fix that NaN error? Or if I were to make the bugfix, could you point me to where in the C++ code it should be?

anujkhare commented 7 years ago

@tdhock I am looking into it right now. I will update soon.

tdhock commented 7 years ago

hey @anujkhare I figured out that there are actually two separate issues.

  1. when there are some features which are constant, iregnet gives Error in eval(substitute(expr), envir, enclos) (from #2) : NANs produced --- actually this is easy to fix by simply filtering features that are constant prior to giving them to the C++ code. and actually this fixes the issue I was having on using cv.iregnet on my real data set.

  2. iregnet still gives an error on the toy data set with 4 or 5 rows. This is not much of a practical problem, but I still think it is worth investigating (the optimum is well-defined, and the new survreg returns a solution).

tdhock commented 7 years ago

Hey @anujkhare I am trying to run the cv.iregnet function on a bunch of real data sets, and I ran into the NAN produced error -- you should be able to reproduce it using the test/testthat/test_NAN.R file.

It seems that the problem has something to do with the 27th column of the feature matrix, because we don't get any error when I do the model fitting without that column. This feature has four distinct values, which are relatively close together -- do you see any reason why that would cause the error?

> table(realNAN$feature.mat[,27])

2.82436150160709 2.84106102867755 2.86027907958597 2.89730564649872 
              18               19               15               15 
anujkhare commented 5 years ago

@theadityasam have you moved all the changes from the NaN branch here as well? If so, let's close the other MR.

Could you remove the vignette from this branch and create a separate branch/MR for it?