run on benchmark data sets

tdhock commented 5 years ago

hi @theadityasam @anujkhare I wrote this R script https://github.com/tdhock/neuroblastoma-data/blob/master/iregnet.R to benchmark the performance of cv.iregnet.

The benchmark includes 33 different data sets, each with several train/test splits (designated by fold ID numbers in folds.csv file).

To run the code please clone that repo, git clone https://github.com/tdhock/neuroblastoma-data.git and then run the "iregnet.R" script.

The script runs cv.iregnet with scale fixed at 1 (iregnet.scale1), and then with estimate_scale=TRUE. (iregnet.estscale)

It also runs baselines penaltyLearning::IntervalRegressionCV (penaltyLearning.scale1) and constant (always predict 0).

Initial results indicate room for improvement, especially in terms of speed. For example I got these results on one data set for test fold ID 1:

> (pred.err <- do.call(rbind, pred.err.list))

         data.name test.fold              algorithm seconds threshold labels
1: ATAC_JV_adipose         1               constant    0.00 predicted    130
2: ATAC_JV_adipose         1         iregnet.scale1   51.80 predicted    130
3: ATAC_JV_adipose         1       iregnet.estscale   52.05 predicted    130
4: ATAC_JV_adipose         1 penaltyLearning.scale1    4.99 predicted    130
   possible.fp possible.fn  min.thresh  max.thresh  fp fn errors       FPR  tp
1:         130         128        -Inf 6.783764882 130  0    130 1.0000000 128
2:         130         128 -0.02067610 0.065002854  20 23     43 0.1538462 105
3:         130         128 -0.04882619 0.032067314  24 21     45 0.1846154 107
4:         130         128 -0.01867027 0.009351324  24 19     43 0.1846154 109
         TPR error.percent       auc
1: 1.0000000     100.00000 0.8881611
2: 0.8203125      33.07692 0.8785457
3: 0.8359375      34.61538 0.8791466
4: 0.8515625      33.07692 0.8780048
>

The results indicate that iregnet (with fixed or estimated scale) is about 10 times slower than penaltyLearning. (50 seconds vs 5 seconds)

I got a warning message that the algo failed to converge so please investigate.

theadityasam commented 5 years ago

Okayy

tdhock commented 5 years ago

by the way iregnet should be faster because it is written in C++ whereas penaltyLearning is written in R

theadityasam commented 5 years ago

The link to the results of the tests I performed on my computer for different threshold values:

https://docs.google.com/spreadsheets/d/1ROq2AnmJ2sc2ZP-DjBwpEhyQ3eD-nLXNPosLDHFWWl0/edit?usp=sharing

tdhock commented 5 years ago

summary of call today.

right now algo is not converging for any data set and for any threshold parameter. why is algo never converging?

how to decide when to exit coord desc loop? when absolute change in all beta coefs less than threshold, we have converged.

need to keep track of which data sets the algo is converging ... either catch warn message or look at return flag and save that as a row in the table.

TODO dotplot algorithm on y axis, auc on x axis, different data sets in panels/facets

please edit iregnet.R script to store threshold as another column (so then we can answer questions like, how does threshold affect timing or accuracy)

TODO edit iregnet.R run survreg on benchmark data sets .. do we get the same solution for lambda=0?

TODO compare with glmnet(family="gaussian") when all outputs are uncensored? do we converge? do we get the same solution?

theadityasam commented 5 years ago

Iregnet does return the error status in its fit object. An error status of -1 indicates that the coordinate descent reached max iteration and didn't converge An error status of -3 indicates that Nan's were produced

A test exists that compares the fit of iregnet and survreg and measures their difference in the beta values in the following link: https://github.com/anujkhare/iregnet/blob/master/tests/testthat/test_survival_glmnet.R#L18 Setting the tolerance to zero, and we see that there are mismatches in the beta values.

Error: fit_s$coefficients not equal to fit_i$beta[, fit_i$num_lambda].
3/3 mismatches (average diff: 1.7e-07)
[1]  653 -  653 ==  4.87e-10
[2] -177 - -177 == -2.45e-07
[3]  315 -  315 ==  2.64e-07

The average difference between them decreases as the threshold is decreased .

Also, how do I plot all the datasets in one image as you @tdhock had recommended?

tdhock commented 5 years ago

ok great well then please monitor (via a column in the results) the return status for each benchmark data set

combine all results in single data table via do.call(rbind, list.of.data.tables), then use ggplot + facet_wrap("data.set")

theadityasam commented 5 years ago

Okayyyy

tdhock commented 5 years ago

if you need help with ggplots I would suggest starting with my animint2 tutorial, which is about interactive data viz, but all the concepts also apply to static data viz with ggplot2 http://members.cbio.mines-paristech.fr/~thocking/animint2-manual/Ch02-ggplot2.html

theadityasam commented 5 years ago

Sorry for the late reply to this issue.

This is the iregnet.R code I used for creating the benchmarks: iregnet.R Have created 3 excel files for testing at different threshold value: 1e-2, 1e-3, 1e-4 And this is the code I used for creating the animint2 plots: plot.code
So basically I created an animint for different threshold values. The plots can definitely be made way more cleaner, please do suggest a method to zoom into them and also naming the variables. Since I was unable to make a clean animint object, I've also attached ggplots for a clearer look. mean.time vs algorithm On the x-axis, the first value represents iregnet.estscale, second value represents iregnet.scale1 and the third value represents penaltyLearning.scale1. ggplot:

auc & error.status vs algorithm An error status of -3 indicates Nan's produced in that particular dataset and error status of -1 indicates failed to converge error in that particular dataset. ggplot:

As for what can be inferred from the plots, with a threshold of 1e-2, iregnet converges on all datasets and produces 0 errors for both - estimating the scale and setting scale as 1. For a threshold of e-3, iregnet doesn't converge on two datasets; and for a threshold of e-4, it runs out of iteration on all of the datasets, and produces NAN's on one of the dataset. Over call, Anuj suggested to try with different number of max iterations and I'll be doing that using the same code as above (do suggest me whether the code needs any changes and how I can make the animints better).

tdhock commented 5 years ago

again the threshold of penaltyLearning::IntervalRegression* does not mean the same thing as the threshold for iregnet, so that makes the plot confusing to interpret. better just run penaltyLearning::IntervalRegressionCV with default values as a baseline.

please use facet_wrap(scales="free") so that each panel has its own y axis --- there is no reason to enforce them all to have the same y axis limits in this case.

can you sort the panels / data sets by how bad iregnet is doing? e.g. first panel is worst data set, last panel is best data set. to do that you need facet by a factor variable with the levels that correspond to the order of presentation (first level is the name of the worst data set, second level is second worst, etc)

theadityasam commented 5 years ago

Time benchmarks with penalty learning auc as base line: animint : link Ggplot:

Error benchmarks: animint : link Ggplot:

anujkhare commented 5 years ago

@tdhock @theadityasam it looks like iregnet does converge, though in too many iterations. I'm not sure why this is happening right now.

theadityasam commented 5 years ago

Hey @tdhock @anujkhare I believe I might have an insight to the problem, do correct me if I made an error anywhere.

So, on inspecting the datasets on which the algorithm is failing to converge, we see that in all of those cases, the number of predictors > number of observations. One such example would be the dataset "H3K36me3_TDH_ENCODE". This dataset has 26 rows and 65 columns. If we do a normal fit without cross validation, no warnings or errors are produced, but during cross validation, the dataset is split into 10 folds by default, reducing the number of observations drastically and hence (possibly) the errors.

According to the coxnet paper, when number of predictors (p) > number of observations (n), the unregularized solution is undefined and hence the beta value shoots off to infinity. It is argued in the paper itself that discarding the unregularized solution is acceptable as any model selection criterion will choose a well regularized fit. On printing the absolute changes in beta value(which we are using as the stopping criterion), we see that the convergence errors are produced on higher lambda indices i.e. lambda values closer to zero. And since the beta values fail to converge without strong regularization, they all reach max iteration, drastically affecting the execution time.

Although, we have implemented the coordinate descent of glmnet, the existing lambda path of iregnet differs from glmnet. I'll be posting the code with the lambda path of glmnet soon along with the results. Hopefully, this fixes all the issues.

tdhock commented 4 years ago

https://github.com/tdhock/neuroblastoma-data/pull/2 is working on this.

theadityasam commented 4 years ago

Heyy @tdhock, for cases wherein predictors > observations, I'll post some results after dimensionality reduction techniques(PCA). Any thoughts on this?

tdhock commented 4 years ago

typically PCA is not interesting with L1 regularization, which is when we want to interpret the weights/beta vector in terms of non-zero coefficients (of the original variables, not PCA variables).

the better solution would be to just discard the small lambda/penalty values (because anyway they are probably not optimal for prediction)

anujkhare / iregnet

run on benchmark data sets #71