Closed gsverhoeven closed 1 year ago
Dear Gertjan,
the selection was done some years ago. We have a unfinished paper where we describe the selection: https://github.com/PhilippPro/OpenML-bench
There you can see a description and also R-Code for the selection of the datasets.
For downloading you can use the following code:
options(java.parameters = "-XX:+UseG1GC") # Should avoid java gc overhead
options(java.parameters = "-Xmx16000m")
library(OpenML)
tasks = listOMLTasks(tag = "OpenML-Reg19")
ds = listOMLDataSets(tag = "OpenML-Reg19")
For applying learners with mlr on these datasets you can use this code:
library(mlr)
lrns = list(
makeLearner("regr.glmnet"), makeLearner("regr.rpart"), makeLearner("regr.kknn")
)
measures = list(mse, mae, medse, medae, rsq, spearmanrho, kendalltau, timetrain)
bmr = list()
for(i in c(1:nrow(tasks))[-28]) {
print(i)
set.seed(123 + i)
task = getOMLTask(tasks$task.id[i])
task = convertOMLTaskToMlr(task)$mlr.task
rdesc = makeResampleDesc("RepCV", reps = 2, folds = 5)
rin = makeResampleInstance(rdesc, task)
bmr[[i]] = benchmark(lrns, task, rin, measures = measures, models = FALSE)
}
This all is documented in the repository (paper section) of the link I gave you above.
I hope I could help you. If you know about more datasets that could be used. Let me know. 30 datasets is still not very much.
Best regards, Philipp
Hi Philipp, this is great and very helpful, thanks.
I found something weird with tecator
, I created an issue over at OpenML (https://github.com/openml/openml-data/issues/44) .
Also: I have made an interesting observation from your regression results comparing tuneRanger with default Ranger on R-squared. I am currently in the process of writing a blog post on it, when I have a draft finished I'll let you know!
regards Gertjan
Hi Gertjan,
is the blog post ready? I am curious about what you found out!
Best regards, Philipp
Am 05.03.22 um 15:26 schrieb Gertjan:
Hi Philipp, this is great and very helpful, thanks.
I found something weird with |tecator|, I created an issue over at OpenML (openml/openml-data#44 https://github.com/openml/openml-data/issues/44) .
Also: I have made an interesting observation from your regression results comparing tuneRanger with default Ranger on R-squared. I am currently in the process of writing a blog post on it, when I have a draft finished I'll let you know!
regards Gertjan
— Reply to this email directly, view it on GitHub https://github.com/PhilippPro/tuneRanger/issues/10#issuecomment-1059774475, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACYJP6VMMGRPNLQX3L7LDZ3U6NVJRANCNFSM5O2NCE4A. Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.
You are receiving this because you commented.Message ID: @.***>
Hi Philipp, I found out that all three datasets that showed large differences between tuned and default Ranger R-squared contained only a few relevant (informative) predictors, in the presence of many irrelevant (noise) variables. This was part of my blog post we discussed earlier this year (which was picked up on twitter during the summer and got some exposure that way :-)). The link to the post is https://gsverhoeven.github.io/post/random-forest-rfe_vs_tuning/
Dear Philipp Probst,
Thanks for creating this nice package! I found your regression results comparing tuned ranger comparing to default ranger, which is exactly what I needed.
However, out of curiosity, how did you select the 29/31 datasets that you used for this comparison? I searched openML, but it seems that a curated suite of datasets for regression does not (yet) exist.
Regards Gertjan