avinashbarnwal / AFTXGBoostPaper

AFT XGBOOST
6 stars 2 forks source link

How to Fast - Xgboost Hyperparameter Search #4

Open avinashbarnwal opened 4 years ago

avinashbarnwal commented 4 years ago

Hi @hcho3 and Prof. @tdhock

We have 18000 combinations for hyperparameter tuning. Please find this code. Link - https://github.com/avinashbarnwal/aftXgboostPaper/blob/master/src/R/production/xgboost/xgboost_hyper.ipynb

I am looking to optimize this. Please let me know if you have any ideas to make it fast.

tdhock commented 4 years ago

I am not an expert on xgboost hyper-parameters. I thought you already discussed this with @hcho?

avinashbarnwal commented 4 years ago

We discussed but didn't realize it would take this much time in R.

hcho commented 4 years ago

Please double check who you are mentioning.

On Wed, Oct 30, 2019 at 6:14 PM Toby Dylan Hocking notifications@github.com wrote:

I am not an expert on xgboost hyper-parameters. I thought you already discussed this with @hcho https://github.com/hcho?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/avinashbarnwal/aftXgboostPaper/issues/4?email_source=notifications&email_token=AB2OWEPGQFQSOTNTNCTJ65TQRIBNLA5CNFSM4JHAPFM2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOECV6FKI#issuecomment-548135593, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB2OWEN5IGI4F57LKVLPMFTQRIBNLANCNFSM4JHAPFMQ .

hcho3 commented 4 years ago

@avinashbarnwal Can you be more specific? How long is each combination taking?

avinashbarnwal commented 4 years ago

Hi @hcho3,

It is taking ~42 secs for running one iteration.

Please find the code here for one iteration - https://github.com/avinashbarnwal/aftXgboostPaper/blob/master/src/R/production/xgboost/xgboost_hyper.ipynb

avinashbarnwal commented 4 years ago

Hi @hcho3, As we discussed, I am porting the code from R to python for hyper-parameter tuning. As grid-search is very slow in R and support for packages like optuna is not there for R.

https://github.com/pfnet/optuna

hcho3 commented 4 years ago

I also granted @avinashbarnwal access to a fast machine.

avinashbarnwal commented 4 years ago

Thanks @hcho3.

avinashbarnwal commented 4 years ago

Hi Prof. @tdhock and @hcho3 ,

I have the results for intervalCV , survival regression and xgboost.

https://github.com/avinashbarnwal/aftXgboostPaper/tree/master/result/ATAC_JV_adipose

Please let me know your thoughts.

tdhock commented 4 years ago

did you double-check the predictions/accuracy of IntervalRegressionCV using my precomputed files, as mentioned in #2 ?

tdhock commented 4 years ago

also did you run it for all data sets or just the ATAC data set?

tdhock commented 4 years ago

also are the 1.csv 2.csv etc files predictions? if so they should have a column for sequenceID so we can compute accuracy metrics. please add one.

avinashbarnwal commented 4 years ago

also did you run it for all data sets or just the ATAC data set?

I have done it only for ATAC data set.

avinashbarnwal commented 4 years ago

also are the 1.csv 2.csv etc files predictions? if so they should have a column for sequenceID so we can compute accuracy metrics. please add one.

Done. Check the particular file - https://github.com/avinashbarnwal/aftXgboostPaper/blob/master/result/ATAC_JV_adipose/intervalCV/%201%20.csv

avinashbarnwal commented 4 years ago

did you double-check the predictions/accuracy of IntervalRegressionCV using my precomputed files, as mentioned in #2 ?

I don't think this is test fold rather cross-validation folders. Please check on the last of the notebook. - https://github.com/avinashbarnwal/aftXgboostPaper/blob/master/src/R/production/penaltyLearning/intervalCV.ipynb

I think we need to test the test-folds, not the cross-validation folds.

tdhock commented 4 years ago

good that you store sequenceIDs in prediction files now.

also what distribution did you use? I think you should compute predictions files for all distributions, and for all test folds.

for my predictions.csv files there is one row for each sequenceID in the test set. Yours should be too. It looks like your predictions are for different sequenceIDs. You should double-check your code.

avinashbarnwal commented 4 years ago

This is based on the best results of cross-validation folds. Now I am considering distributions as hyper-parameter.

I am making the folds based on the folds data folder. I would like to have a quick call to weed out this hiccup.