dmlc / xgboost

Scalable, Portable and Distributed Gradient Boosting (GBDT, GBRT or GBM) Library, for Python, R, Java, Scala, C++ and more. Runs on single machine, Hadoop, Spark, Dask, Flink and DataFlow
https://xgboost.readthedocs.io/en/stable/
Apache License 2.0
26.35k stars 8.74k forks source link

Google Summer of Code: new loss functions in XGBoost #4242

Closed tdhock closed 5 years ago

tdhock commented 5 years ago

I was wondering if there is any xgboost maintainers who would be interested to co-mentor a Google Summer of Code (GSOC) student with me? I am a machine learning researcher who is interested in using xgboost for some standard loss/objective functions which are not included in xgboost. I am also an expert R package developer and admin/mentor for R in GSOC so we could probably get the student funded under R-GSOC, https://github.com/rstats-gsoc/gsoc2019/wiki/table%20of%20proposed%20coding%20projects

However I am not an expert about xgboost internals, so it would be great if somebody who is could co-mentor. Any takers?

For (left, right, and interval) censored outputs, AFT (https://en.wikipedia.org/wiki/Accelerated_failure_time_model) losses:

I found some related issues: https://github.com/dmlc/xgboost/issues/749 https://github.com/dmlc/xgboost/issues/513 https://github.com/dmlc/xgboost/issues/326

RAMitchell commented 5 years ago

Don't have time myself but great idea.

hcho3 commented 5 years ago

So you don’t have to be a Google employee to be a mentor in GSOC? I am interested

tdhock commented 5 years ago

no you don't have to be a google employee!

It would be great to co-mentor with you @hcho3

I have started writing a wiki page for the project here https://github.com/rstats-gsoc/gsoc2019/wiki/xgboost-loss-functions

can you please add some info to the project and tests sections? The tests should be some simple to complex tasks that the student can do prior to GSOC in order to demonstrate to us mentors that he would be capable of doing the project.

hcho3 commented 5 years ago

@tdhock Can candidates code in C++? I can add some tests related to C++ core and XGBoost internals

tdhock commented 5 years ago

yes that would be great if you could add tests related to C++ core and xgboost internals

we probably should not accept any student who has not demonstrated C++ coding skills (that is the point of writing these tests)

hcho3 commented 5 years ago

@tdhock I'll get to writing tests soon. What would be my responsibilities as a co-mentor? I'm trying to gauge my time commitment. Also, are mentees all hosted on the Google campus?

tdhock commented 5 years ago

no students are hosted on google campus --everything is online

mentor responsibility is basically for one of us to be available for a 1 hour skype call each week during the summer, and to answer student questions via email https://developers.google.com/open-source/gsoc/help/responsibilities

hcho3 commented 5 years ago

@tdhock Got it. Thanks for clarification. I'd be more than happy to help.

hcho3 commented 5 years ago

@tdhock What's the deadline for the tests? When do you want them by?

tdhock commented 5 years ago

asap! students need to do the tests in the next week or two, then work on an app to submit to google due Apr 7

hcho3 commented 5 years ago

Got it. I'll try to post tests as early as possible. Expect them no later than in ~three~ two days from now.

tdhock commented 5 years ago

I added a test about using the functionality in the current xgboost package, but it would be great if you could add some tests about C++ coding / xgboost internals

thvasilo commented 5 years ago

Linking back to a related topic on the forum for left-truncated data: https://discuss.xgboost.ai/t/support-for-left-truncated-data-time-dependent-covariates-for-cox-regression/651

hcho3 commented 5 years ago

@tdhock Almost there, it should be done by tomorrow

hcho3 commented 5 years ago

@tdhock I've put up five tests (three Easy, one Medium, one Hard). Can you review?

tdhock commented 5 years ago

the first five easy questions are a bit theoretical (gsoc test usually tend to be more practical coding exercises) but for this project I think that is OK, as the student should understand the theory.

median and hard tests look great.

thanks for your help and let's hope we find a good student.

hcho3 commented 5 years ago

@tdhock Maybe we should move the binary classification question to "Medium." What do you think?

tdhock commented 5 years ago

either way is fine with me

hcho3 commented 5 years ago

Done. Let's hope for the best.

hcho3 commented 5 years ago

Closing this in favor of #4491