anujkhare / iregnet

7 stars 12 forks source link

Nan #63

Closed theadityasam closed 5 years ago

theadityasam commented 5 years ago

The NAN issue will be fixed and cv function will be made functional

tdhock commented 5 years ago

TODO for the cv function ... please do NOT re-shuffle folds if there is one that has either all left censored outputs or all right censored outputs. in that case please just stop with an informative error that tells the user they need to manually specify the foldid argument. here is some similar code in one of my other packages, https://github.com/tdhock/penaltyLearning/blob/master/R/IntervalRegression.R#L189

tdhock commented 5 years ago

also @anujkhare can you please give @theadityasam push access to your repo so he can create future branches/PRs in your repo? (I think it would make it simpler to keep track of the project if everything is in your repo)

theadityasam commented 5 years ago

Okay, won't re-shuffle the folds. Also, for the censored data error message, if I implement a method to check in the R code itself, I'll have to go through the dataset twice - once for checking whether the model can be fit and then again while assigning the censorship types in get_censoring_types of iregnet_fit.cpp which might affect the run time of iregnet. https://github.com/theadityasam/iregnet/blob/nan/src/iregnet_fit.cpp#L401

I believe, it would be better to write a new piece of R code that checks for the censorship condition and assigns the censorship type done similar to what is done in https://github.com/theadityasam/iregnet/blob/nan/src/iregnet_fit.cpp#L401 The result will be then passed as argument to the C++ code. This way, we don't need to go through the dataset twice.

tdhock commented 5 years ago

checking in R code will not result in any significant slowdown if you use vector operations

theadityasam commented 5 years ago

Okayy, will write a new R snippet for the checking

tdhock commented 5 years ago

try debugging using print statements or gdb https://tdhock.github.io/blog/2019/gdb/

anujkhare commented 5 years ago

@tdhock @theadityasam investigated the cases where NaNs are still produced. I think that they're cases where the unregularized solution does not exist. survreg also produces inf values.

The ideal behavior is that we should fit till the lambda value that is well-behaved, throw a warning for the next lambda that produces a NaN, and then return an iregnet object still. There may be better ways to calculate the lambda path, but I'm not sure yet.

@theadityasam - anything to add? We should make the corresponding changes for this.

anujkhare commented 5 years ago

There is a reference to this in section 2.3 of this paper about glmnet. For the cases where the number of covariates (p) > number of samples (n), the unregularized solution is undefined (beta shoots to inf).

They ignore solutions for lambda values close to 0 in such cases.

tdhock commented 5 years ago

that is interesting. do you have any concrete p>n examples that can be coded as tests? it makes sense to me to only consider large lambda values in that case, and still return an iregnet object, with a warning.

anujkhare commented 5 years ago

Closing since this has been merged into the cv branch: #54 .