EpistasisLab / pmlb

PMLB: A large, curated repository of benchmark datasets for evaluating supervised machine learning algorithms.
https://epistasislab.github.io/pmlb/
MIT License
805 stars 135 forks source link

fix issue #19, update promoters metadata #78

Closed lacava closed 4 years ago

lacava commented 4 years ago

this (unfinished!) PR removes the promoters dataset because it is a duplicate of molecular_biology_promoters (issue #19). I also began to add metadata to molecular_biology_promoters, but I have not yet gotten the source data to exactly match our version. It also appears that the class labels are reversed which we might want to fix.

here is a colab notebook where i'm working on source verification

trangdata commented 4 years ago

The UCI link gives the pre-processed data. I think it's more straightforward if we just cite this link from openml.

I can't save my changes on your colab notebook, so I made a copy to verify they're the same here.

We also need to remove the instance column because it's a row identifier, as mentioned in #19 (see profiling report). I can help with that.

weixuanfu commented 4 years ago

I found instance feature in the metadata but cannot find it on the dataset.