EpistasisLab / pmlb

PMLB: A large, curated repository of benchmark datasets for evaluating supervised machine learning algorithms.
https://epistasislab.github.io/pmlb/
MIT License
805 stars 135 forks source link

Metadata datatypes update #48

Closed lacava closed 4 years ago

lacava commented 4 years ago

based on our discussions, the generated data types are now encoded as categorical or continuous. In addition, binary datatypes are captured for endpoints and summary_stats counts. this PR also adds options to write_metadata to keep it from overwriting metadata.yaml files that have changed the header, indicating they are customized.

jwehrmann commented 4 years ago

Hey, sorry for bothering you guys. I know this is a work in progress, but I am assuming that the categorical features have been detected using some automatic script, correct? I am afraid that it is not possible to do this kind of inference based solely on the data itself. For instance, MNIST features have been detected as categorical, though we know that such features are intensities, rather than categories. (In addition, I could not find discrete/integer features in the metadata files in the current version of this branch). Maybe a better approach would be doing a task force or something like that to manually validate the datasets. I would not mind to manually check a bunch of datasets.

lacava commented 4 years ago

Hey, sorry for bothering you guys. I know this is a work in progress, but I am assuming that the categorical features have been detected using some automatic script, correct? I am afraid that it is not possible to do this kind of inference based solely on the data itself. For instance, MNIST features have been detected as categorical, though we know that such features are intensities, rather than categories. (In addition, I could not find discrete/integer features in the metadata files in the current version of this branch). Maybe a better approach would be doing a task force or something like that to manually validate the datasets.

Yes, we realize it is not entirely manual. The script (pmlb/write_metadata.py) is just a rough starting point. We are hoping to have contributors like yourself go in and update metadata files, such as these PRs: #41 , #44, etc. You can see some guidelines we put together in this section of our Contributing Guide

I would not mind to manually check a bunch of datasets.

Awesome! Thanks so much! If you want, you can send a PR to our PMLB2.0 branch on some datasets like MNIST and we'll get it checked and merged.