Open avinashbarnwal opened 5 years ago
May 28, 2019 Update
@avinashbarnwal @tdhock
hi @avinashbarnwal can you please post the source code and rendered images for the loss charts you created?
looks like a good start but the plot is incorrect for uncensored data. it should look like the square loss around the label for the normal distribution. also you should try creating a facetted ggplot, with one panel per censoring type. for inspiration here is the code I used for the 1-page AFT poster, https://github.com/tdhock/aft-poster/blob/master/figure-loss.R
Thanks, Prof. @tdhock I am looking into it.
looks better @avinashbarnwal ! glad to see that you got the facetted ggplots working.
however it looks like there is a problem with your computation of the logistic loss for the uncensored output -- it should be minimal at the label.
also for next week's homework please check your formulas for the gradient in the overleaf. add a row of plots to the figure that shows the loss functions. use facet_grid(fun ~ type)
-- rows for different functions (loss, gradient, hessian)
Prof. @tdhock, @hcho3
I have made the loss function with all the changes required. Here is the link for the Plot - https://github.com/avinashbarnwal/GSOC-2019/blob/master/AFT/R/loss_aft.png
I have changed the code as well. Code Link - https://github.com/avinashbarnwal/GSOC-2019/blob/master/AFT/R/assigment1.R
Prof. @tdhock Please check the formula for the Normal - Uncensored part. I have used different formula compared to given in this https://github.com/avinashbarnwal/GSOC-2019/blob/master/paper/HOCKING-AFT.pdf, rather than this I have used the formula given in this document. (http://home.iitk.ac.in/~kundu/paper146.pdf).
looks better Avinash. but why doesn't the uncensored loss go to zero? (it should...)
about the normal - censored loss, you should double check your work by using the normal CDF (pnorm in R)
On Wed, Jun 5, 2019 at 6:47 AM Avinash Barnwal notifications@github.com wrote:
@tdhock https://github.com/tdhock @hcho3 https://github.com/hcho3
I have made the loss function with all the changes required. Here is the link for the Plot - https://github.com/avinashbarnwal/GSOC-2019/blob/master/AFT/R/loss_aft.png
I have changed the code as well. Code Link - https://github.com/avinashbarnwal/GSOC-2019/blob/master/AFT/R/assigment1.R
@ Please check the formula for the Normal - Censored part. I have used different formula compared to formula given in this https://github.com/avinashbarnwal/GSOC-2019/blob/master/paper/HOCKING-AFT.pdf, rather than this i have used formula given in this document. ( http://home.iitk.ac.in/~kundu/paper146.pdf)
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/avinashbarnwal/xgboost/issues/1?email_source=notifications&email_token=AAHDX4QWXCGCFMV5LJLUBYDPY67WLA5CNFSM4HOPO5U2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODW7YEAI#issuecomment-499089921, or mute the thread https://github.com/notifications/unsubscribe-auth/AAHDX4T5DS6SKKWT47LROO3PY67WLANCNFSM4HOPO5UQ .
Prof. @tdhock, The uncensored Loss function Normal- -log(1/(t.lower sigma sqrt(2 pi)) exp((log(t.lower/y.hat))*2/(-2 sigma * sigma)))
We have a constant term -log(1/(t.lower sigma sqrt(2*pi)) irrespective of y.hat. This makes it non-zero, this is based on my thinking.
Similarly for Logistic.
Sorry, I meant Normal Uncensored formula in the above.
Normal Uncensored Old Formula - -log(1/(y.hat sigma sqrt(2 pi)) exp((log(t.lower/y.hat))*2/(-2 sigma * sigma)))
Normal Uncensored New Formula - -log(1/(t.lower sigma sqrt(2 pi)) exp((log(t.lower/y.hat))*2/(-2 sigma * sigma)))
right but we usually subtract away the constant terms. if you do that you should recover the square loss which is 0 at the min.
On Wed, Jun 5, 2019 at 10:22 AM Avinash Barnwal notifications@github.com wrote:
Prof. @tdhock https://github.com/tdhock, The uncensored Loss function Normal- -log(1/(t.lowersigmasqrt(2 *pi))exp((log(t.lower/y.hat))*2/(-2sigmasigma)))
We have a constant term -log(1/(t.lowersigmasqrt(2*pi)) irrespective of y.hat. This makes it non-zero, this is based on my thinking.
Similarly for Logistic.
Sorry, I meant Normal Uncensored formula to check.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/avinashbarnwal/xgboost/issues/1?email_source=notifications&email_token=AAHDX4UKPXEFEYOE3EVFZBTPY7Y6ZA5CNFSM4HOPO5U2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODXANOZQ#issuecomment-499177318, or mute the thread https://github.com/notifications/unsubscribe-auth/AAHDX4T65PX5QDPTNE52GYDPY7Y6ZANCNFSM4HOPO5UQ .
Hi Prof. @tdhock, @hcho3
Please find the updated document for AFT(https://github.com/avinashbarnwal/GSOC-2019/blob/master/doc/Accelerated_Failure_Time.pdf). I think I have made mistake here as I am taking gradient and hessian with respect to y.hat, not X\beta.
For least square loss function-gradient and hessian with respect to y.hat and X\beta are same. But it would start mattering when we have link functions involved. Similarly, for classification, we take gradient and hessian with respect to X\beta, not y.hat.
For reference, please check doc - https://cran.r-project.org/web/packages/gbm/vignettes/gbm.pdf Bernoulli part.
This leads to more decision on notations as in survival document we are using f for pdf and in the above document, we have f(x_i) for X\beta.
Please let me know your thoughts.
I dont think there is an issue because the link function is always the identity with the normal and logistic models. again we should be able to use the formulas in the survival manual.
Prof. @tdhock, I think we need to calculate gradient with respect to log(y.hat) not y.hat as log(y.hat) = X\beta. Please let me know.
section 6.8 of survival manual gives derivatives with respect to eta, which is the real-valued prediction.
in your notation you use log(y.hat) for the real valued prediction, so that is the same as eta from the survival manual.
in your PDF please do not use x\beta as that is only valid for linear models
Prof. @tdhock, Thanks. I will use eta for log(y.hat) and take gradient and Hessian based on that. In my document, I was taking gradient and Hessian based on y.hat, not on eta. I will make the correction.
Prof. @tdhock, @hcho3, I have created the document and plot for loss, negative gradient, and hessian.
Document - https://github.com/avinashbarnwal/GSOC-2019/blob/master/doc/Accelerated_Failure_Time.pdf Plot - https://github.com/avinashbarnwal/GSOC-2019/blob/master/AFT/R/loss_grad_hess_aft.png Code - https://github.com/avinashbarnwal/GSOC-2019/blob/master/AFT/R/combined_assignment.R
Please let me know your thoughts.
the plot looks good except for the hessian for the interval censored outputs for small predictions -- it seems to be too large. can you please double check? it should look like right censored outputs (i.e. with finite lower limit) on the left side of the plot
Thanks, Prof. @tdhock. I have changed the code and plot both. I had one sign in the formula wrong. Please recheck the plot (https://github.com/avinashbarnwal/GSOC-2019/blob/master/AFT/R/loss_grad_hess_aft.png). Do you think Interval hessian is correct for Normal distribution?
Hi @hcho3 and Prof. @tdhock,
Please find the python implementation of gradient boosting for AFT.
Code - https://github.com/avinashbarnwal/GSOC-2019/blob/master/AFT/py/gb_aft.ipynb Plot - https://github.com/avinashbarnwal/GSOC-2019/blob/master/AFT/py/Nesterov_False.png
and let me know your thoughts.
We'll need to check the POC implementation, since the training loss should trend down, not up.
Also, let's plot mean(abs(log(Y) - eta))
, since the predicted score eta
should try to match log(Y)
.
Hi @hcho3 and Prof. @tdhock,
As discussed with @hcho3, I am working on documentation for binomial loss and POC for the same.
Hi @hcho3 and Prof. @tdhock,
Please find the document for binomial loss and code below:-
Document - https://github.com/avinashbarnwal/GSOC-2019/blob/master/doc/Binomial_Loss.pdf. Code - https://github.com/avinashbarnwal/GSOC-2019/blob/master/AFT/R/binomial_loss.R. Plot - https://github.com/avinashbarnwal/GSOC-2019/blob/master/AFT/R/binomial_loss.png.
there is still something wrong for the interval hessian for small predicted values on https://github.com/avinashbarnwal/GSOC-2019/blob/master/AFT/R/loss_grad_hess_aft.png -- it should be constant as the prediction goes to zero, log(prediction) goes to -Inf
binomial loss looks reasonable, except for the x axis label.
please (1) use facet_grid and (2) use more grid points and (3) maybe have different columns or colors for different labels
Hi Prof. @tdhock and @hcho3 ,
Please check the R plots and Python Plots again -
R Plots -
Python Plots-
aft hessian looks good now but binomial hessian has a problem: should be zero as prediction goes to Inf
On Tue, Jul 2, 2019 at 12:48 PM Avinash Barnwal notifications@github.com wrote:
Hi Prof. @tdhock https://github.com/tdhock and @hcho3 https://github.com/hcho3 ,
Please check the R plots and Python Plots again -
R Plots -
- AFT - https://github.com/avinashbarnwal/GSOC-2019/blob/master/AFT/R/loss_grad_hess_aft.png
- Binomial Loss - https://github.com/avinashbarnwal/GSOC-2019/blob/master/AFT/R/binomial_loss.png
Python Plots-
- Log loss, Data Type - Mixed - https://github.com/avinashbarnwal/GSOC-2019/blob/master/AFT/py/Nesterov_False_Loss_LogLoss_Data_Mixed.png
- Mae, Data Type - Mixed - https://github.com/avinashbarnwal/GSOC-2019/blob/master/AFT/py/Nesterov_False_Loss_Mae_Data_Mixed.png
- Log loss, Data Type - Uncen - https://github.com/avinashbarnwal/GSOC-2019/blob/master/AFT/py/Nesterov_False_Loss_Logloss_Data_Uncensored.png
- Mae, Data Type - Uncen - https://github.com/avinashbarnwal/GSOC-2019/blob/master/AFT/py/Nesterov_False_Loss_Mae_Data_Uncensored.png
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/avinashbarnwal/xgboost/issues/1?email_source=notifications&email_token=AAHDX4VZ2MJ5RUWZJTB6DRTP5OWIBA5CNFSM4HOPO5U2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODZCLNYA#issuecomment-507819744, or mute the thread https://github.com/notifications/unsubscribe-auth/AAHDX4X2M3WDCMXBXZ2NZU3P5OWIBANCNFSM4HOPO5UQ .
paper with an example of a real-world problem with only left, right, and interval censored labels: http://proceedings.mlr.press/v28/hocking13.html
discussed:
Hi, Prof @tdhock and @hcho3,
Please check the updated plot and code for binomial loss.
Plot - https://github.com/avinashbarnwal/GSOC-2019/blob/master/AFT/R/binomial_loss.png Code - https://github.com/avinashbarnwal/GSOC-2019/blob/master/AFT/R/binomial_loss.R Doc - https://github.com/avinashbarnwal/GSOC-2019/blob/master/doc/Binomial_Loss.pdf
that looks more reasonable
suggestions: (1) more grid points, (2) no need to use aes(color) because you already put the different functions in different panels, (3) add more columns for different labels, e.g. y=5, n=10; y=0, n=10;, y=2, n=10
On Tue, Jul 2, 2019 at 9:44 PM Avinash Barnwal notifications@github.com wrote:
Hi, Prof @tdhock https://github.com/tdhock and @hcho3 https://github.com/hcho3,
Please check the updated plot and code for binomial loss.
Plot - https://github.com/avinashbarnwal/GSOC-2019/blob/master/AFT/R/binomial_loss.png Code - https://github.com/avinashbarnwal/GSOC-2019/blob/master/AFT/R/binomial_loss.R Doc - https://github.com/avinashbarnwal/GSOC-2019/blob/master/doc/Binomial_Loss.pdf
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/avinashbarnwal/xgboost/issues/1?email_source=notifications&email_token=AAHDX4RFXVQLPHGHKXV7VRTP5QVD7A5CNFSM4HOPO5U2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODZDH2QY#issuecomment-507936067, or mute the thread https://github.com/notifications/unsubscribe-auth/AAHDX4TPGGDI6EMLZVEHBVDP5QVD7ANCNFSM4HOPO5UQ .
Hi Prof. @tdhock ,
Please find the updated binomial loss:- R Binomial Plot - https://github.com/avinashbarnwal/GSOC-2019/blob/master/AFT/R/binomial_loss.png. AFT Plot - https://github.com/avinashbarnwal/GSOC-2019/blob/master/AFT/R/loss_grad_hess_aft.png.
Py Updated Plots
LogLoss - Mixed - https://github.com/avinashbarnwal/GSOC-2019/blob/master/AFT/py/Nesterov_False_Loss_LogLoss_Data_Mixed.png Mae - Mixed - https://github.com/avinashbarnwal/GSOC-2019/blob/master/AFT/py/Nesterov_False_Loss_Mae_Data_Uncensored.png LogLoss - Uncensored - https://github.com/avinashbarnwal/GSOC-2019/blob/master/AFT/py/Nesterov_False_Loss_Logloss_Data_Uncensored.png Mae - Uncensored - https://github.com/avinashbarnwal/GSOC-2019/blob/master/AFT/py/Nesterov_False_Loss_Mae_Data_Uncensored.png
C++ Functions for Pdf and cdf of Normal and Logistic Distribution
https://github.com/avinashbarnwal/GSOC-2019/tree/master/AFT/C%2B%2B
I have tested with x=0 for each distributions and it was giving correct results compared to R inbuilt functions.
Thanks, @hcho3 for changing the hessian formula for interval data.
@avinashbarnwal The AFT loss plots now look good. I'll take a look at the C++ code soon.
Hi Prof. @tdhock and @hcho3
I have been working on implementing loss, negative gradient and hessian functions for AFT in C++ and test it through plots on Python.
For details- Please find the link. https://github.com/avinashbarnwal/GSOC-2019/tree/master/AFT/C%2B%2B
Please let me know your thoughts.
@avinashbarnwal I modified your notebook slightly to plot the uncensored AFT loss with yhat in the X axis:
The curve for the normal should not cliff off like that.
In general, let's try to reproduce the R plots using C++ code. That way, we have assurance that C++ code does what R code does.
right, the loss function should be convex (for a fixed scale parameter)
On Tue, Jul 16, 2019 at 10:25 PM Philip Hyunsu Cho notifications@github.com wrote:
In general, let's try to reproduce the R plots using C++ code. That way, we have assurance that C++ code does what R code does.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/avinashbarnwal/xgboost/issues/1?email_source=notifications&email_token=AAHDX4X4322NHO4MS7I3Z7DP7YVDXA5CNFSM4HOPO5U2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD2CBFBI#issuecomment-511971973, or mute the thread https://github.com/notifications/unsubscribe-auth/AAHDX4SXEVZXCIGAMZ6LGNLP7YVDXANCNFSM4HOPO5UQ .
hi there @avinashbarnwal is the R learning algo for AFT losses working yet?
When it is I would suggest modifying https://github.com/tdhock/neuroblastoma-data/blob/master/iregnet.R which I wrote for benchmarking a different learning algorithm for censored outputs, on 33 different labeled data sets.
@tdhock Thanks for the link. The datasets should come in handy.
Prof. @tdhock ,
For both AFT and binomial loss, R learning algorithms(loss, negative gradient and hessian) are working. I am currently writing C++ functions for binomial loss. Soon we will have everything in the xgboost package.
ok where is the fork/branch with the AFT models? can I install it?
Here is a new benchmark script https://github.com/tdhock/neuroblastoma-data/blob/master/xgboost.R#L41
The benchmark includes 33 different data sets, each with several train/test splits (designated by fold ID numbers in folds.csv file).
To run the code please fork then clone that repo, and then modify the "xgboost.R" script so xgboost works on these learning problems. (run the script / start R in the neuroblastoma-data directory)
The script also runs baselines penaltyLearning::IntervalRegressionCV (penaltyLearning.scale1) and constant (always predict 0).
@avinashbarnwal
R learning algorithms(loss, negative gradient and hessian) are working
I don't think you have boosting PoC in R? We do have boosting PoC in Python, however.
Prof. @tdhock ,
We haven't add the AFT loss function in xgboost package yet. We are testing it first in C++ and then I will add to the package. Please let me know if you want proof of concept in R.
@avinashbarnwal For now, can we run Python boosting PoC on the 33 datasets?
no problem if it is not ready yet but you may consider using those data sets to test /debug the algo.
if you don't use that R script then you need to know that
> folds.csv.vec <- Sys.glob("data/*/cv/*/folds.csv")
> folds.csv.vec
[1] "data/ATAC_JV_adipose/cv/equal_labels/folds.csv"
[2] "data/CTCF_TDH_ENCODE/cv/equal_labels/folds.csv"
[3] "data/H3K27ac-H3K4me3_TDHAM_BP/cv/equal_labels/folds.csv"
[4] "data/H3K27ac_TDH_some/cv/equal_labels/folds.csv"
[5] "data/H3K27me3_RL_cancer/cv/equal_labels/folds.csv"
[6] "data/H3K27me3_TDH_some/cv/equal_labels/folds.csv"
[7] "data/H3K36me3_AM_immune/cv/equal_labels/folds.csv"
[8] "data/H3K36me3_TDH_ENCODE/cv/equal_labels/folds.csv"
[9] "data/H3K36me3_TDH_immune/cv/equal_labels/folds.csv"
[10] "data/H3K36me3_TDH_other/cv/equal_labels/folds.csv"
[11] "data/H3K4me1_TDH_BP/cv/equal_labels/folds.csv"
[12] "data/H3K4me3_PGP_immune/cv/equal_labels/folds.csv"
[13] "data/H3K4me3_TDH_ENCODE/cv/equal_labels/folds.csv"
[14] "data/H3K4me3_TDH_immune/cv/equal_labels/folds.csv"
[15] "data/H3K4me3_TDH_other/cv/equal_labels/folds.csv"
[16] "data/H3K4me3_XJ_immune/cv/equal_labels/folds.csv"
[17] "data/H3K9me3_TDH_BP/cv/equal_labels/folds.csv"
[18] "data/detailed/cv/R-3.6.0-chrom/folds.csv"
[19] "data/detailed/cv/R-3.6.0-profileID/folds.csv"
[20] "data/detailed/cv/R-3.6.0-profileSize/folds.csv"
[21] "data/detailed/cv/R-3.6.0-sequenceID/folds.csv"
[22] "data/detailed/cv/chrom/folds.csv"
[23] "data/detailed/cv/profileID/folds.csv"
[24] "data/detailed/cv/profileSize/folds.csv"
[25] "data/detailed/cv/sequenceID/folds.csv"
[26] "data/systematic/cv/R-3.6.0-chrom/folds.csv"
[27] "data/systematic/cv/R-3.6.0-profileID/folds.csv"
[28] "data/systematic/cv/R-3.6.0-profileSize/folds.csv"
[29] "data/systematic/cv/R-3.6.0-sequenceID/folds.csv"
[30] "data/systematic/cv/chrom/folds.csv"
[31] "data/systematic/cv/profileID/folds.csv"
[32] "data/systematic/cv/profileSize/folds.csv"
[33] "data/systematic/cv/sequenceID/folds.csv"
>
Thanks. Prof. @tdhock and @hcho3 I will show the results for given datasets.
Also if we have time we should consider supporting the Extreme Value distribution (in addition to logistic and normal), since in stats people often like to use Weibull/Exponential models, and they are special cases of the EV dist.
Hi Prof. @tdhock and @hcho3,
I have added the C++ code for binomial loss and notebook for validating the same. Links C++ - https://github.com/avinashbarnwal/GSOC-2019/tree/master/BinomialLoss/C%2B%2B Notebook - https://github.com/avinashbarnwal/GSOC-2019/blob/master/BinomialLoss/Python%20Notebook/Visualizing%20distributions.ipynb
I am in the process of testing the AFT for 33 datasets. Link - https://github.com/avinashbarnwal/GSOC-2019/blob/master/AFT/test/data/neuroblastoma-data-master/src/notebook/001_data_massage.ipynb.
For Extreme Value Distribution, I will look into it as we have Extreme Value Distribution in the survival document shared by you. I will change the R, Python and C++ Plots accordingly.
@avinashbarnwal I left a comment about the binomial loss implementation: avinashbarnwal/GSOC-2019#5. We do not want to use n!/ ((n-r)! r!)
because 1) it has many redundant computation and 2) it may cause overflows.
21MAY2019 Relevant Information -
There is no need for the survival object. This is more of a legacy.
How to handle sigma in AFT? Ans:- Treat as a hyperparameter.
What is the dimension of the predicted value of interval regression? Ans:- It is always one real value, not interval.
Relevant Document https://github.com/tdhock/aft-poster/blob/master/HOCKING-AFT.pdf http://members.cbio.mines-paristech.fr/~thocking/survival.pdf https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-9-269
Please Check. @tdhock, @hcho3