EpistasisLab / pmlb

PMLB: A large, curated repository of benchmark datasets for evaluating supervised machine learning algorithms.
https://epistasislab.github.io/pmlb/
MIT License
805 stars 135 forks source link

Wrong target! #54

Closed daniel0710goldberg closed 4 years ago

daniel0710goldberg commented 4 years ago

The current target is a train/test selector, not a dependent variable. According to openML, most analysis was properly done on this dataset uses the 6th feature, "Drinks", as the target.

The PMLB dataset should be changed to reflect this: treat 6th feature, currently "Drinks", as the new target, and remove 7th feature.

P.S. The metadata.yaml file is currently reflecting this change, minus a TODO that needs to be removed.

weixuanfu commented 4 years ago

Thank you for catching this error! We need discuss about whether we should keep this train/test selector into the dataset or not. If we keep the selector then the metafeature need annotate that and pmlb should ignore the column based on it.

Besides that, I think it should still be classification task instead of regression task one beacuse:

  1. drinks has 16 unique values in 345 samples
  2. It was considered as classification benchmark in the beginning ( From the important note in OpenML: Researchers who wish to use this dataset as a classification benchmark should follow the method used in experiments by the donor (Forsyth & Rada, 1986, Machine learning: applications in expert systems and information retrieval))
trangdata commented 4 years ago

I think we should remove the train/test selector column completely to avoid serious downstream errors. Plus, this column (despite being "not entirely random") does not add meaningful information to the data.

In this survey (pdf), they discussed papers listed on the UCI page that cited the dataset. Many studies used it incorrectly. A few other studies (Turney, Tang et al.), dichotomize the 6th variable (number of alcoholic drinks) using 3 as the numeric threshold (x6 < 3 vs x6 >= 3). If we want to keep the task as classification, I think we should use this threshold and note in the metadata description of target how it's dichotomized.

(Note that the original study used different threshold values in different experiments.)

Alternatively, we can have both datasets bupa_class and bupa_reg for two different tasks.

weixuanfu commented 4 years ago

I fixed this issue with commits above.