EpistasisLab / pmlb

PMLB: A large, curated repository of benchmark datasets for evaluating supervised machine learning algorithms.
https://epistasislab.github.io/pmlb/
MIT License
804 stars 134 forks source link

Missing feature names in Wisconsin dataset #20

Open trangdata opened 4 years ago

trangdata commented 4 years ago

Currently, the features in the Wisconsin Prognostic Breast Cancer dataset do not have names.

The (I think) corresponding dataset on OpenML or even Kaggle seem to have this information. It would be helpful for these feature names to be added.

weixuanfu commented 4 years ago

@lacava Any idea? Should we update this dataset based on OpenML?

trangdata commented 4 years ago

Similar issue for the tic-tac-toe dataset. OpenML ref: https://www.openml.org/d/50

lacava commented 4 years ago

@lacava Any idea? Should we update this dataset based on OpenML?

sure, we just need to make sure they match.

It would be helpful for these feature names to be added.

agreed! if you have bandwidth to submit a PR please do

trangdata commented 4 years ago

I think it's difficult for outsiders to help because we're not sure where the current datasets came from. I think in general it would also be helpful to add details/metadata for these datasets, e.g. source, meaning of features/classes, as asked here and wished here.

lacava commented 4 years ago

I think it's difficult for outsiders to help because we're not sure where the current datasets came from.

Unfortunately we are all in that situation with this project. Fortunately, the source of most of these datasets is pretty obvious. If everyone tackled a few datasets and verified their origin (e.g. through a checksum as in here) we could quickly have origin information attached to most of the datasets. The only realistic way I see it happening is if everyone does a few and submits PRs.

I think in general it would also be helpful to add details/metadata for these datasets, e.g. source, meaning of features/classes, as asked here and wished here.

Agreed; that's discussed in issue #13. At the moment, metadata properties for the datasets are extracted for the readme files since PR #11.