avinashbarnwal / xgboost

Scalable, Portable and Distributed Gradient Boosting (GBDT, GBRT or GBM) Library, for Python, R, Java, Scala, C++ and more. Runs on single machine, Hadoop, Spark, Flink and DataFlow
https://xgboost.ai/
Apache License 2.0
3 stars 0 forks source link

Summary of GSOC2019 calls #1

Open avinashbarnwal opened 5 years ago

avinashbarnwal commented 5 years ago

21MAY2019 Relevant Information -

  1. What should be xgb dmatrix for Interval regression ? Ans :- Create two column for all the times 2-column matrix representation of outputs=labels.
    1. un-censored output, event is obserrved, e.g. y_i = 5. (5, 5)

      survival::Surv(10, 10, type="interval2") [1] 10

  2. left-censored output, event not observed, but we know it happened some time before t_i=5. y_i = (-Inf = \underline y_i, 5=t_i=\overline y_i)

    survival::Surv(-Inf, 10, type="interval2") [1] 10-

  3. Right-censored output, event not observed , but we know it happened some time after t_i =5. y_i = (5,inf)

    survival::Surv(5, Inf, type="interval2") [1] 5+

  4. Interval-censored output, event not observed, but we know it happened some time in y_i = (5,10)

    survival::Surv(5, 10, type="interval2") [1] [5, 10]

There is no need for the survival object. This is more of a legacy.

  1. How to handle sigma in AFT? Ans:- Treat as a hyperparameter.

  2. What is the dimension of the predicted value of interval regression? Ans:- It is always one real value, not interval.

Relevant Document https://github.com/tdhock/aft-poster/blob/master/HOCKING-AFT.pdf http://members.cbio.mines-paristech.fr/~thocking/survival.pdf https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-9-269

Please Check. @tdhock, @hcho3

hcho3 commented 5 years ago

May 28, 2019 Update

  1. Done: re-created the loss charts from Toby’s paper to show that loss functions are reasonable
  2. Done: submitted RFC to public XGBoost repo
  3. In-Progress: Python Proof-of-Concept for end-to-end gradient boosting with survival loss
  4. In-Progress: Derive formulas for AFT

@avinashbarnwal @tdhock

tdhock commented 5 years ago

hi @avinashbarnwal can you please post the source code and rendered images for the loss charts you created?

hcho3 commented 5 years ago

@tdhock See https://github.com/avinashbarnwal/gsoc/tree/master/AFT/R

tdhock commented 5 years ago

looks like a good start but the plot is incorrect for uncensored data. it should look like the square loss around the label for the normal distribution. also you should try creating a facetted ggplot, with one panel per censoring type. for inspiration here is the code I used for the 1-page AFT poster, https://github.com/tdhock/aft-poster/blob/master/figure-loss.R

avinashbarnwal commented 5 years ago

Thanks, Prof. @tdhock I am looking into it.

tdhock commented 5 years ago

looks better @avinashbarnwal ! glad to see that you got the facetted ggplots working.

however it looks like there is a problem with your computation of the logistic loss for the uncensored output -- it should be minimal at the label.

tdhock commented 5 years ago

also for next week's homework please check your formulas for the gradient in the overleaf. add a row of plots to the figure that shows the loss functions. use facet_grid(fun ~ type) -- rows for different functions (loss, gradient, hessian)

  1. first row should be loss function
  2. second row should be gradient
avinashbarnwal commented 5 years ago

Prof. @tdhock, @hcho3

I have made the loss function with all the changes required. Here is the link for the Plot - https://github.com/avinashbarnwal/GSOC-2019/blob/master/AFT/R/loss_aft.png

I have changed the code as well. Code Link - https://github.com/avinashbarnwal/GSOC-2019/blob/master/AFT/R/assigment1.R

Prof. @tdhock Please check the formula for the Normal - Uncensored part. I have used different formula compared to given in this https://github.com/avinashbarnwal/GSOC-2019/blob/master/paper/HOCKING-AFT.pdf, rather than this I have used the formula given in this document. (http://home.iitk.ac.in/~kundu/paper146.pdf).

tdhock commented 5 years ago

looks better Avinash. but why doesn't the uncensored loss go to zero? (it should...)

about the normal - censored loss, you should double check your work by using the normal CDF (pnorm in R)

On Wed, Jun 5, 2019 at 6:47 AM Avinash Barnwal notifications@github.com wrote:

@tdhock https://github.com/tdhock @hcho3 https://github.com/hcho3

I have made the loss function with all the changes required. Here is the link for the Plot - https://github.com/avinashbarnwal/GSOC-2019/blob/master/AFT/R/loss_aft.png

I have changed the code as well. Code Link - https://github.com/avinashbarnwal/GSOC-2019/blob/master/AFT/R/assigment1.R

@ Please check the formula for the Normal - Censored part. I have used different formula compared to formula given in this https://github.com/avinashbarnwal/GSOC-2019/blob/master/paper/HOCKING-AFT.pdf, rather than this i have used formula given in this document. ( http://home.iitk.ac.in/~kundu/paper146.pdf)

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/avinashbarnwal/xgboost/issues/1?email_source=notifications&email_token=AAHDX4QWXCGCFMV5LJLUBYDPY67WLA5CNFSM4HOPO5U2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODW7YEAI#issuecomment-499089921, or mute the thread https://github.com/notifications/unsubscribe-auth/AAHDX4T5DS6SKKWT47LROO3PY67WLANCNFSM4HOPO5UQ .

avinashbarnwal commented 5 years ago

Prof. @tdhock, The uncensored Loss function Normal- -log(1/(t.lower sigma sqrt(2 pi)) exp((log(t.lower/y.hat))*2/(-2 sigma * sigma)))

We have a constant term -log(1/(t.lower sigma sqrt(2*pi)) irrespective of y.hat. This makes it non-zero, this is based on my thinking.

Similarly for Logistic.

Sorry, I meant Normal Uncensored formula in the above.

Normal Uncensored Old Formula - -log(1/(y.hat sigma sqrt(2 pi)) exp((log(t.lower/y.hat))*2/(-2 sigma * sigma)))

Normal Uncensored New Formula - -log(1/(t.lower sigma sqrt(2 pi)) exp((log(t.lower/y.hat))*2/(-2 sigma * sigma)))

tdhock commented 5 years ago

right but we usually subtract away the constant terms. if you do that you should recover the square loss which is 0 at the min.

On Wed, Jun 5, 2019 at 10:22 AM Avinash Barnwal notifications@github.com wrote:

Prof. @tdhock https://github.com/tdhock, The uncensored Loss function Normal- -log(1/(t.lowersigmasqrt(2 *pi))exp((log(t.lower/y.hat))*2/(-2sigmasigma)))

We have a constant term -log(1/(t.lowersigmasqrt(2*pi)) irrespective of y.hat. This makes it non-zero, this is based on my thinking.

Similarly for Logistic.

Sorry, I meant Normal Uncensored formula to check.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/avinashbarnwal/xgboost/issues/1?email_source=notifications&email_token=AAHDX4UKPXEFEYOE3EVFZBTPY7Y6ZA5CNFSM4HOPO5U2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODXANOZQ#issuecomment-499177318, or mute the thread https://github.com/notifications/unsubscribe-auth/AAHDX4T65PX5QDPTNE52GYDPY7Y6ZANCNFSM4HOPO5UQ .

avinashbarnwal commented 5 years ago

Hi Prof. @tdhock, @hcho3

Please find the updated document for AFT(https://github.com/avinashbarnwal/GSOC-2019/blob/master/doc/Accelerated_Failure_Time.pdf). I think I have made mistake here as I am taking gradient and hessian with respect to y.hat, not X\beta.

For least square loss function-gradient and hessian with respect to y.hat and X\beta are same. But it would start mattering when we have link functions involved. Similarly, for classification, we take gradient and hessian with respect to X\beta, not y.hat.

For reference, please check doc - https://cran.r-project.org/web/packages/gbm/vignettes/gbm.pdf Bernoulli part.

This leads to more decision on notations as in survival document we are using f for pdf and in the above document, we have f(x_i) for X\beta.

Please let me know your thoughts.

tdhock commented 5 years ago

I dont think there is an issue because the link function is always the identity with the normal and logistic models. again we should be able to use the formulas in the survival manual.

avinashbarnwal commented 5 years ago

Prof. @tdhock, I think we need to calculate gradient with respect to log(y.hat) not y.hat as log(y.hat) = X\beta. Please let me know.

tdhock commented 5 years ago

section 6.8 of survival manual gives derivatives with respect to eta, which is the real-valued prediction.

tdhock commented 5 years ago

in your notation you use log(y.hat) for the real valued prediction, so that is the same as eta from the survival manual.

in your PDF please do not use x\beta as that is only valid for linear models

avinashbarnwal commented 5 years ago

Prof. @tdhock, Thanks. I will use eta for log(y.hat) and take gradient and Hessian based on that. In my document, I was taking gradient and Hessian based on y.hat, not on eta. I will make the correction.

avinashbarnwal commented 5 years ago

Prof. @tdhock, @hcho3, I have created the document and plot for loss, negative gradient, and hessian.

Document - https://github.com/avinashbarnwal/GSOC-2019/blob/master/doc/Accelerated_Failure_Time.pdf Plot - https://github.com/avinashbarnwal/GSOC-2019/blob/master/AFT/R/loss_grad_hess_aft.png Code - https://github.com/avinashbarnwal/GSOC-2019/blob/master/AFT/R/combined_assignment.R

Please let me know your thoughts.

tdhock commented 5 years ago

the plot looks good except for the hessian for the interval censored outputs for small predictions -- it seems to be too large. can you please double check? it should look like right censored outputs (i.e. with finite lower limit) on the left side of the plot

avinashbarnwal commented 5 years ago

Thanks, Prof. @tdhock. I have changed the code and plot both. I had one sign in the formula wrong. Please recheck the plot (https://github.com/avinashbarnwal/GSOC-2019/blob/master/AFT/R/loss_grad_hess_aft.png). Do you think Interval hessian is correct for Normal distribution?

avinashbarnwal commented 5 years ago

Hi @hcho3 and Prof. @tdhock,

Please find the python implementation of gradient boosting for AFT.

Code - https://github.com/avinashbarnwal/GSOC-2019/blob/master/AFT/py/gb_aft.ipynb Plot - https://github.com/avinashbarnwal/GSOC-2019/blob/master/AFT/py/Nesterov_False.png

and let me know your thoughts.

hcho3 commented 5 years ago

We'll need to check the POC implementation, since the training loss should trend down, not up. Also, let's plot mean(abs(log(Y) - eta)), since the predicted score eta should try to match log(Y).

avinashbarnwal commented 5 years ago

Hi @hcho3 and Prof. @tdhock,

As discussed with @hcho3, I am working on documentation for binomial loss and POC for the same.

avinashbarnwal commented 5 years ago

Hi @hcho3 and Prof. @tdhock,

Please find the document for binomial loss and code below:-

Document - https://github.com/avinashbarnwal/GSOC-2019/blob/master/doc/Binomial_Loss.pdf. Code - https://github.com/avinashbarnwal/GSOC-2019/blob/master/AFT/R/binomial_loss.R. Plot - https://github.com/avinashbarnwal/GSOC-2019/blob/master/AFT/R/binomial_loss.png.

tdhock commented 5 years ago

there is still something wrong for the interval hessian for small predicted values on https://github.com/avinashbarnwal/GSOC-2019/blob/master/AFT/R/loss_grad_hess_aft.png -- it should be constant as the prediction goes to zero, log(prediction) goes to -Inf

tdhock commented 5 years ago

binomial loss looks reasonable, except for the x axis label.

please (1) use facet_grid and (2) use more grid points and (3) maybe have different columns or colors for different labels

avinashbarnwal commented 5 years ago

Hi Prof. @tdhock and @hcho3 ,

Please check the R plots and Python Plots again -

R Plots -

  1. AFT - https://github.com/avinashbarnwal/GSOC-2019/blob/master/AFT/R/loss_grad_hess_aft.png
  2. Binomial Loss - https://github.com/avinashbarnwal/GSOC-2019/blob/master/AFT/R/binomial_loss.png

Python Plots-

  1. Log loss, Data Type - Mixed - https://github.com/avinashbarnwal/GSOC-2019/blob/master/AFT/py/Nesterov_False_Loss_LogLoss_Data_Mixed.png
  2. Mae, Data Type - Mixed - https://github.com/avinashbarnwal/GSOC-2019/blob/master/AFT/py/Nesterov_False_Loss_Mae_Data_Mixed.png
  3. Log loss, Data Type - Uncen - https://github.com/avinashbarnwal/GSOC-2019/blob/master/AFT/py/Nesterov_False_Loss_Logloss_Data_Uncensored.png
  4. Mae, Data Type - Uncen - https://github.com/avinashbarnwal/GSOC-2019/blob/master/AFT/py/Nesterov_False_Loss_Mae_Data_Uncensored.png
tdhock commented 5 years ago

aft hessian looks good now but binomial hessian has a problem: should be zero as prediction goes to Inf

On Tue, Jul 2, 2019 at 12:48 PM Avinash Barnwal notifications@github.com wrote:

Hi Prof. @tdhock https://github.com/tdhock and @hcho3 https://github.com/hcho3 ,

Please check the R plots and Python Plots again -

R Plots -

  1. AFT - https://github.com/avinashbarnwal/GSOC-2019/blob/master/AFT/R/loss_grad_hess_aft.png
  2. Binomial Loss - https://github.com/avinashbarnwal/GSOC-2019/blob/master/AFT/R/binomial_loss.png

Python Plots-

  1. Log loss, Data Type - Mixed - https://github.com/avinashbarnwal/GSOC-2019/blob/master/AFT/py/Nesterov_False_Loss_LogLoss_Data_Mixed.png
  2. Mae, Data Type - Mixed - https://github.com/avinashbarnwal/GSOC-2019/blob/master/AFT/py/Nesterov_False_Loss_Mae_Data_Mixed.png
  3. Log loss, Data Type - Uncen - https://github.com/avinashbarnwal/GSOC-2019/blob/master/AFT/py/Nesterov_False_Loss_Logloss_Data_Uncensored.png
  4. Mae, Data Type - Uncen - https://github.com/avinashbarnwal/GSOC-2019/blob/master/AFT/py/Nesterov_False_Loss_Mae_Data_Uncensored.png

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/avinashbarnwal/xgboost/issues/1?email_source=notifications&email_token=AAHDX4VZ2MJ5RUWZJTB6DRTP5OWIBA5CNFSM4HOPO5U2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODZCLNYA#issuecomment-507819744, or mute the thread https://github.com/notifications/unsubscribe-auth/AAHDX4X2M3WDCMXBXZ2NZU3P5OWIBANCNFSM4HOPO5UQ .

tdhock commented 5 years ago

paper with an example of a real-world problem with only left, right, and interval censored labels: http://proceedings.mlr.press/v28/hocking13.html

tdhock commented 5 years ago

discussed:

  1. compute loss via dnorm or log/exp? in C++ normal distn functions are provided in std lib, but logistic dist functions are not. logistic provided in boost but we don't want to depend on that.
  2. check work on binomial loss via duplicate data trick.
  3. how to specify binomial labels? most standard way would be two-column matrix of #success and #possible as in https://en.wikipedia.org/wiki/Binomial_distribution
  4. how to specify censored labels? two-column matrix, no event indicator needed.
avinashbarnwal commented 5 years ago

Hi, Prof @tdhock and @hcho3,

Please check the updated plot and code for binomial loss.

Plot - https://github.com/avinashbarnwal/GSOC-2019/blob/master/AFT/R/binomial_loss.png Code - https://github.com/avinashbarnwal/GSOC-2019/blob/master/AFT/R/binomial_loss.R Doc - https://github.com/avinashbarnwal/GSOC-2019/blob/master/doc/Binomial_Loss.pdf

tdhock commented 5 years ago

that looks more reasonable

suggestions: (1) more grid points, (2) no need to use aes(color) because you already put the different functions in different panels, (3) add more columns for different labels, e.g. y=5, n=10; y=0, n=10;, y=2, n=10

On Tue, Jul 2, 2019 at 9:44 PM Avinash Barnwal notifications@github.com wrote:

Hi, Prof @tdhock https://github.com/tdhock and @hcho3 https://github.com/hcho3,

Please check the updated plot and code for binomial loss.

Plot - https://github.com/avinashbarnwal/GSOC-2019/blob/master/AFT/R/binomial_loss.png Code - https://github.com/avinashbarnwal/GSOC-2019/blob/master/AFT/R/binomial_loss.R Doc - https://github.com/avinashbarnwal/GSOC-2019/blob/master/doc/Binomial_Loss.pdf

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/avinashbarnwal/xgboost/issues/1?email_source=notifications&email_token=AAHDX4RFXVQLPHGHKXV7VRTP5QVD7A5CNFSM4HOPO5U2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODZDH2QY#issuecomment-507936067, or mute the thread https://github.com/notifications/unsubscribe-auth/AAHDX4TPGGDI6EMLZVEHBVDP5QVD7ANCNFSM4HOPO5UQ .

avinashbarnwal commented 5 years ago

Hi Prof. @tdhock ,

Please find the updated binomial loss:- R Binomial Plot - https://github.com/avinashbarnwal/GSOC-2019/blob/master/AFT/R/binomial_loss.png. AFT Plot - https://github.com/avinashbarnwal/GSOC-2019/blob/master/AFT/R/loss_grad_hess_aft.png.

Py Updated Plots

LogLoss - Mixed - https://github.com/avinashbarnwal/GSOC-2019/blob/master/AFT/py/Nesterov_False_Loss_LogLoss_Data_Mixed.png Mae - Mixed - https://github.com/avinashbarnwal/GSOC-2019/blob/master/AFT/py/Nesterov_False_Loss_Mae_Data_Uncensored.png LogLoss - Uncensored - https://github.com/avinashbarnwal/GSOC-2019/blob/master/AFT/py/Nesterov_False_Loss_Logloss_Data_Uncensored.png Mae - Uncensored - https://github.com/avinashbarnwal/GSOC-2019/blob/master/AFT/py/Nesterov_False_Loss_Mae_Data_Uncensored.png

C++ Functions for Pdf and cdf of Normal and Logistic Distribution

https://github.com/avinashbarnwal/GSOC-2019/tree/master/AFT/C%2B%2B

I have tested with x=0 for each distributions and it was giving correct results compared to R inbuilt functions.

Thanks, @hcho3 for changing the hessian formula for interval data.

hcho3 commented 5 years ago

@avinashbarnwal The AFT loss plots now look good. I'll take a look at the C++ code soon.

avinashbarnwal commented 5 years ago

Hi Prof. @tdhock and @hcho3

I have been working on implementing loss, negative gradient and hessian functions for AFT in C++ and test it through plots on Python.

For details- Please find the link. https://github.com/avinashbarnwal/GSOC-2019/tree/master/AFT/C%2B%2B

Please let me know your thoughts.

hcho3 commented 5 years ago

@avinashbarnwal I modified your notebook slightly to plot the uncensored AFT loss with yhat in the X axis: download

The curve for the normal should not cliff off like that.

hcho3 commented 5 years ago

In general, let's try to reproduce the R plots using C++ code. That way, we have assurance that C++ code does what R code does.

tdhock commented 5 years ago

right, the loss function should be convex (for a fixed scale parameter)

On Tue, Jul 16, 2019 at 10:25 PM Philip Hyunsu Cho notifications@github.com wrote:

In general, let's try to reproduce the R plots using C++ code. That way, we have assurance that C++ code does what R code does.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/avinashbarnwal/xgboost/issues/1?email_source=notifications&email_token=AAHDX4X4322NHO4MS7I3Z7DP7YVDXA5CNFSM4HOPO5U2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD2CBFBI#issuecomment-511971973, or mute the thread https://github.com/notifications/unsubscribe-auth/AAHDX4SXEVZXCIGAMZ6LGNLP7YVDXANCNFSM4HOPO5UQ .

tdhock commented 5 years ago

hi there @avinashbarnwal is the R learning algo for AFT losses working yet?

When it is I would suggest modifying https://github.com/tdhock/neuroblastoma-data/blob/master/iregnet.R which I wrote for benchmarking a different learning algorithm for censored outputs, on 33 different labeled data sets.

hcho3 commented 5 years ago

@tdhock Thanks for the link. The datasets should come in handy.

avinashbarnwal commented 5 years ago

Prof. @tdhock ,

For both AFT and binomial loss, R learning algorithms(loss, negative gradient and hessian) are working. I am currently writing C++ functions for binomial loss. Soon we will have everything in the xgboost package.

tdhock commented 5 years ago

ok where is the fork/branch with the AFT models? can I install it?

Here is a new benchmark script https://github.com/tdhock/neuroblastoma-data/blob/master/xgboost.R#L41

The benchmark includes 33 different data sets, each with several train/test splits (designated by fold ID numbers in folds.csv file).

To run the code please fork then clone that repo, and then modify the "xgboost.R" script so xgboost works on these learning problems. (run the script / start R in the neuroblastoma-data directory)

The script also runs baselines penaltyLearning::IntervalRegressionCV (penaltyLearning.scale1) and constant (always predict 0).

hcho3 commented 5 years ago

@avinashbarnwal

R learning algorithms(loss, negative gradient and hessian) are working

I don't think you have boosting PoC in R? We do have boosting PoC in Python, however.

avinashbarnwal commented 5 years ago

Prof. @tdhock ,

We haven't add the AFT loss function in xgboost package yet. We are testing it first in C++ and then I will add to the package. Please let me know if you want proof of concept in R.

hcho3 commented 5 years ago

@avinashbarnwal For now, can we run Python boosting PoC on the 33 datasets?

tdhock commented 5 years ago

no problem if it is not ready yet but you may consider using those data sets to test /debug the algo.

if you don't use that R script then you need to know that

  1. input/feature matrix is in data/DATA_NAME/inputs.csv.xz -- some features are missing (NA) or infinite in some rows -- you can ignore those features/columns.
  2. output/label matrix is in data/DATA_NAME/outputs.csv.xz
  3. would be good to compare to another learner, at least a baseline constant prediction, would be better to compare against regularized linear model / computing test AUC as in my R script.
  4. suggested cross-validation folds are given in folds.csv files, e.g.
    > folds.csv.vec <- Sys.glob("data/*/cv/*/folds.csv")
    > folds.csv.vec
    [1] "data/ATAC_JV_adipose/cv/equal_labels/folds.csv"         
    [2] "data/CTCF_TDH_ENCODE/cv/equal_labels/folds.csv"         
    [3] "data/H3K27ac-H3K4me3_TDHAM_BP/cv/equal_labels/folds.csv"
    [4] "data/H3K27ac_TDH_some/cv/equal_labels/folds.csv"        
    [5] "data/H3K27me3_RL_cancer/cv/equal_labels/folds.csv"      
    [6] "data/H3K27me3_TDH_some/cv/equal_labels/folds.csv"       
    [7] "data/H3K36me3_AM_immune/cv/equal_labels/folds.csv"      
    [8] "data/H3K36me3_TDH_ENCODE/cv/equal_labels/folds.csv"     
    [9] "data/H3K36me3_TDH_immune/cv/equal_labels/folds.csv"     
    [10] "data/H3K36me3_TDH_other/cv/equal_labels/folds.csv"      
    [11] "data/H3K4me1_TDH_BP/cv/equal_labels/folds.csv"          
    [12] "data/H3K4me3_PGP_immune/cv/equal_labels/folds.csv"      
    [13] "data/H3K4me3_TDH_ENCODE/cv/equal_labels/folds.csv"      
    [14] "data/H3K4me3_TDH_immune/cv/equal_labels/folds.csv"      
    [15] "data/H3K4me3_TDH_other/cv/equal_labels/folds.csv"       
    [16] "data/H3K4me3_XJ_immune/cv/equal_labels/folds.csv"       
    [17] "data/H3K9me3_TDH_BP/cv/equal_labels/folds.csv"          
    [18] "data/detailed/cv/R-3.6.0-chrom/folds.csv"               
    [19] "data/detailed/cv/R-3.6.0-profileID/folds.csv"           
    [20] "data/detailed/cv/R-3.6.0-profileSize/folds.csv"         
    [21] "data/detailed/cv/R-3.6.0-sequenceID/folds.csv"          
    [22] "data/detailed/cv/chrom/folds.csv"                       
    [23] "data/detailed/cv/profileID/folds.csv"                   
    [24] "data/detailed/cv/profileSize/folds.csv"                 
    [25] "data/detailed/cv/sequenceID/folds.csv"                  
    [26] "data/systematic/cv/R-3.6.0-chrom/folds.csv"             
    [27] "data/systematic/cv/R-3.6.0-profileID/folds.csv"         
    [28] "data/systematic/cv/R-3.6.0-profileSize/folds.csv"       
    [29] "data/systematic/cv/R-3.6.0-sequenceID/folds.csv"        
    [30] "data/systematic/cv/chrom/folds.csv"                     
    [31] "data/systematic/cv/profileID/folds.csv"                 
    [32] "data/systematic/cv/profileSize/folds.csv"               
    [33] "data/systematic/cv/sequenceID/folds.csv"                
    > 
avinashbarnwal commented 5 years ago

Thanks. Prof. @tdhock and @hcho3 I will show the results for given datasets.

tdhock commented 5 years ago

Also if we have time we should consider supporting the Extreme Value distribution (in addition to logistic and normal), since in stats people often like to use Weibull/Exponential models, and they are special cases of the EV dist.

avinashbarnwal commented 5 years ago

Hi Prof. @tdhock and @hcho3,

I have added the C++ code for binomial loss and notebook for validating the same. Links C++ - https://github.com/avinashbarnwal/GSOC-2019/tree/master/BinomialLoss/C%2B%2B Notebook - https://github.com/avinashbarnwal/GSOC-2019/blob/master/BinomialLoss/Python%20Notebook/Visualizing%20distributions.ipynb

I am in the process of testing the AFT for 33 datasets. Link - https://github.com/avinashbarnwal/GSOC-2019/blob/master/AFT/test/data/neuroblastoma-data-master/src/notebook/001_data_massage.ipynb.

For Extreme Value Distribution, I will look into it as we have Extreme Value Distribution in the survival document shared by you. I will change the R, Python and C++ Plots accordingly.

hcho3 commented 5 years ago

@avinashbarnwal I left a comment about the binomial loss implementation: avinashbarnwal/GSOC-2019#5. We do not want to use n!/ ((n-r)! r!) because 1) it has many redundant computation and 2) it may cause overflows.