Closed daniel0710goldberg closed 4 years ago
Thank you for catching this error! We need discuss about whether we should keep this train/test selector into the dataset or not. If we keep the selector then the metafeature need annotate that and pmlb should ignore the column based on it.
Besides that, I think it should still be classification task instead of regression task one beacuse:
I think we should remove the train/test selector column completely to avoid serious downstream errors. Plus, this column (despite being "not entirely random") does not add meaningful information to the data.
In this survey (pdf), they discussed papers listed on the UCI page that cited the dataset. Many studies used it incorrectly. A few other studies (Turney, Tang et al.), dichotomize the 6th variable (number of alcoholic drinks) using 3 as the numeric threshold (x6 < 3 vs x6 >= 3). If we want to keep the task as classification, I think we should use this threshold and note in the metadata description of target
how it's dichotomized.
(Note that the original study used different threshold values in different experiments.)
Alternatively, we can have both datasets bupa_class
and bupa_reg
for two different tasks.
I fixed this issue with commits above.
The current target is a train/test selector, not a dependent variable. According to openML, most analysis was properly done on this dataset uses the 6th feature, "Drinks", as the target.
The PMLB dataset should be changed to reflect this: treat 6th feature, currently "Drinks", as the new
target
, and remove 7th feature.P.S. The metadata.yaml file is currently reflecting this change, minus a TODO that needs to be removed.