EpistasisLab / pmlb

PMLB: A large, curated repository of benchmark datasets for evaluating supervised machine learning algorithms.
https://epistasislab.github.io/pmlb/
MIT License
805 stars 135 forks source link

Duplicate datasets. #167

Open alexzwanenburg opened 2 years ago

alexzwanenburg commented 2 years ago

While trying to identify which data sets from the modeldata R package are already present in pmlb, I found that quite a few datasets are duplicates or simple subsets of other datasets.

My proposal is to remove duplicates, using an original dataset where this can be found. This might also address the following issues:

trangdata commented 2 years ago

Thank you so much for your detailed investigation of the dataset collection @alexzwanenburg! Would you have the bandwidth to make a PR to address (even part of) the duplications?

alexzwanenburg commented 2 years ago

Yes I can create the PR to address this issue, it may take a few weeks to fully address these issues though.

I have two questions:

lacava commented 2 years ago

all those suggestions look good to me.

lacava commented 1 year ago

Hi @alexzwanenburg , thanks again for your work spearheading this. Do you still plan to make a PR for these changes? 🙏

alexzwanenburg commented 1 year ago

Yes, but I still need to update the four final datasets. I can create a PR for the work I have already done.

lacava commented 1 year ago

ping on this @alexzwanenburg , hopefully we could pick up where you left off if you create a PR

gkronber commented 1 year ago

@alexzwanenburg I'm ready to help finish this PR. Is your fork up-to-date with your changes documented in this issue?

alexzwanenburg commented 11 months ago

I made a PR. I haven't addressed the last four datasets.