selection criteria of regression datasets?

gsverhoeven commented 2 years ago

Dear Philipp Probst,

Thanks for creating this nice package! I found your regression results comparing tuned ranger comparing to default ranger, which is exactly what I needed.

However, out of curiosity, how did you select the 29/31 datasets that you used for this comparison? I searched openML, but it seems that a curated suite of datasets for regression does not (yet) exist.

Regards Gertjan

PhilippPro commented 2 years ago

Dear Gertjan,

the selection was done some years ago. We have a unfinished paper where we describe the selection: https://github.com/PhilippPro/OpenML-bench

There you can see a description and also R-Code for the selection of the datasets.

For downloading you can use the following code:

options(java.parameters = "-XX:+UseG1GC") # Should avoid java gc overhead
options(java.parameters = "-Xmx16000m")

library(OpenML)

tasks = listOMLTasks(tag = "OpenML-Reg19")
ds = listOMLDataSets(tag = "OpenML-Reg19")

For applying learners with mlr on these datasets you can use this code:

library(mlr)
lrns = list(
  makeLearner("regr.glmnet"), makeLearner("regr.rpart"), makeLearner("regr.kknn")
) 
measures = list(mse, mae, medse, medae, rsq, spearmanrho, kendalltau, timetrain)

bmr = list()
for(i in c(1:nrow(tasks))[-28]) {
  print(i)
  set.seed(123 + i)
  task = getOMLTask(tasks$task.id[i])
  task = convertOMLTaskToMlr(task)$mlr.task
  rdesc = makeResampleDesc("RepCV", reps = 2, folds = 5)
  rin = makeResampleInstance(rdesc, task)
  bmr[[i]] = benchmark(lrns, task, rin, measures = measures, models = FALSE)
}

This all is documented in the repository (paper section) of the link I gave you above.

I hope I could help you. If you know about more datasets that could be used. Let me know. 30 datasets is still not very much.

Best regards, Philipp

gsverhoeven commented 2 years ago

Hi Philipp, this is great and very helpful, thanks.

I found something weird with tecator, I created an issue over at OpenML (https://github.com/openml/openml-data/issues/44) .

Also: I have made an interesting observation from your regression results comparing tuneRanger with default Ranger on R-squared. I am currently in the process of writing a blog post on it, when I have a draft finished I'll let you know!

regards Gertjan

PhilippPro commented 1 year ago

Hi Gertjan,

is the blog post ready? I am curious about what you found out!

Best regards, Philipp

Am 05.03.22 um 15:26 schrieb Gertjan:

Hi Philipp, this is great and very helpful, thanks.

I found something weird with |tecator|, I created an issue over at OpenML (openml/openml-data#44 https://github.com/openml/openml-data/issues/44) .

Also: I have made an interesting observation from your regression results comparing tuneRanger with default Ranger on R-squared. I am currently in the process of writing a blog post on it, when I have a draft finished I'll let you know!

regards Gertjan

— Reply to this email directly, view it on GitHub https://github.com/PhilippPro/tuneRanger/issues/10#issuecomment-1059774475, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACYJP6VMMGRPNLQX3L7LDZ3U6NVJRANCNFSM5O2NCE4A. Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

You are receiving this because you commented.Message ID: @.***>

gsverhoeven commented 1 year ago

Hi Philipp, I found out that all three datasets that showed large differences between tuned and default Ranger R-squared contained only a few relevant (informative) predictors, in the presence of many irrelevant (noise) variables. This was part of my blog post we discussed earlier this year (which was picked up on twitter during the summer and got some exposure that way :-)). The link to the post is https://gsverhoeven.github.io/post/random-forest-rfe_vs_tuning/

PhilippPro / tuneRanger

selection criteria of regression datasets? #10