Duplicate datasets. - Githubissues

alexzwanenburg commented 2 years ago

While trying to identify which data sets from the modeldata R package are already present in pmlb, I found that quite a few datasets are duplicates or simple subsets of other datasets.

cmc and contraceptive are the same. The original can be found on the UCI ML repository.
- [x] Parse data from the original into the expected format.
- [x] Deprecate cmc and contraceptive datasets.
195_auto_price and 207_autoPrice. The symboling feature underwent a shift between both datasets. Note: the underlying dataset seems to be the same as the one used for auto. The difference between the datasets is the target, which is price for 195_auto_price and 207_autoPrice, and symboling for auto, as well as how missing values were removed. The original dataset may be found on the UCI ML repository.
- [x] Parse data from the original into the expected format with price as target.
- [x] Parse data from the original into the expected format with symboling as target.
- [x] Ensure that Description of each new dataset references the other.
- [x] Deprecate 195_auto_price, 207_autoPrice and auto datasets.
glass and prnn_fglass. The target class levels seem to be switched between datasets. The original can be found on the UCI ML repository.
- [x] Parse data from the original into the expected format.
- [x] Deprecate glass and prnn_fglass datasets.
heart_c, cleve, cleveland_nominal and cleveland. The cleve and heart_c data sets have a binarized target (vs. ordinal in the other two datasets); the cleveland_nominal data set contains only a feature subset. The original can be found on the UCI ML repository.
heart_statlog is a subset of the cleve data set.
heart_h and hungarian appear to be the same.
- [x] Parse Cleveland data from the original into the expected format.
- [x] Parse Hungarian data from the original into the expected format.
- [x] Parse Switzerland data (currently missing) from the original into the expected format.
- [x] Parse VA Long beach data (currently missing) from the original into the expected format.
- [x] Deprecate heart_c, cleve, cleveland_nominal, cleveland, heart_statlog, heart_h and hungarian datasets.
colic and horse_colic appear to be the same. The original can be found on the UCI ML repository. This issue was also mentioned in #75.
- [x] Parse data from the original into the expected format.
- [x] Deprecate colic and horse_colic datasets.
vote and house_votes_84 are identical.
- [x] Identify original source.
- [x] Parse data from the original into the expected format.
- [x] Deprecate vote and house_votes_84 datasets.
breast_cancer_wisconsin and wdbc are the same. The original can be found on the UCI ML repository.
- [x] Parse data from the original into the expected format.
- [x] Deprecate breast_cancer_wisconsin and wdbc datasets.
australian, buggyCrx, credit_a and crx are identical or based on the same data.
- [x] Identify original source.
- [x] Parse data from the original into the expected format.
- [x] Deprecate australian, buggyCrx, credit_a and crx datasets.
breast_w and breast are based on the same data. The breast dataset has a Sample code number feature that is not present in breast_w. The original can be found on the UCI ML repository.
- [x] Parse data from the original into the expected format.
- [x] Deprecate breast_w and breast datasets.
diabetes and pima appear to be identical.
- [x] Identify original source. This dataset appears to have been hosted at the UCI ML repository. However, the original owner seems to have withdrawn permission to use this dataset.
- [x] ~~Parse data from the original into the expected format.~~
- [x] Deprecate diabetes and pima datasets.
credit_g and german appear to be identical.
- [x] Identify original source. The original can be found the UCI ML repository.
- [x] Parse data from the original into the expected format.
- [x] Deprecate credit_g and german datasets.
solar_flare_2 and flare derive from the same data, but differ in the way the target is formulated. solar_flare_2 also contains two additional features.
- [x] Identify original source. The original can be found the UCI ML repository. There are three targets, of which one is useful for ML prediction. The additional features in solar_flare_2 are in fact the other two targets.
- [x] Parse data from the original into the expected format.
- [x] Deprecate solar_flare_2 and flare datasets.
car and car_evaluation are based on the same dataset. In the car_evaluation dataset several categorical (ordinal) features from car are one-hot-encoded. The original can be found on the UCI ML repository. This issue was also mention in #84.
- [x] Parse data from the original into the expected format.
- [x] Deprecate car and car_evaluation datasets.
chess and kr_vs_kp are identical. The original can be found on the UCI ML repository.
- [ ] Parse data from the original into the expected format.
- [ ] Deprecate chess and kr_vs_kp datasets.
satimage and 294_satellite_image are the same, with the exception that 294_satellite_image incorrectly specifies a regression problem. The original can be found on the UCI ML repository, and has multiple (6) classes as target.
- [ ] Parse data from the original into the expected format.
- [ ] Deprecate satimage and 294_satellite_image datasets.
197_cpu_act, 227_cpu_small, 562_cpu_small and 573_cpu_act are based on the same dataset, with the difference being that 227_cpu_small and 562_cpu_small have fewer features.
- [ ] Identify original source.
- [ ] Parse data from the original into the expected format.
- [ ] Deprecate 197_cpu_act, 227_cpu_small, 562_cpu_small and 573_cpu_act datasets.
poker and 1595_poker are identical except for the target specification. The original can be found on the UCI ML repository, and suggest the target is ordinal.
- [ ] Parse data from the original into the expected format.
- [ ] Deprecate poker and 1595_poker datasets.

My proposal is to remove duplicates, using an original dataset where this can be found. This might also address the following issues:

20
75
84
159

trangdata commented 2 years ago

Thank you so much for your detailed investigation of the dataset collection @alexzwanenburg! Would you have the bandwidth to make a PR to address (even part of) the duplications?

alexzwanenburg commented 2 years ago

Yes I can create the PR to address this issue, it may take a few weeks to fully address these issues though.

I have two questions:

What should I do with the duplicate datasets? Issue #119 was not fully addressed. I would propose to add a deprecated tag to the metadata yaml file that, if present, refers to the new dataset. I would expect this to have the following behaviour:
- [ ] If the deprecated tag is present, and not empty (~ or null), the dataset will no longer be visible on the PMLB GitHub Pages.
- [ ] If the deprecated tag is present, and not empty (~ or null), fetching the dataset will produce a warning.
- [ ] Deprecated data sets are fully removed with the next major release (v2.0.0).
Some datasets have a known license. Can I add a license tag to the metadata to document this, e.g. license: CC-BY-4.0?

lacava commented 2 years ago

all those suggestions look good to me.

lacava commented 1 year ago

Hi @alexzwanenburg , thanks again for your work spearheading this. Do you still plan to make a PR for these changes? 🙏

alexzwanenburg commented 1 year ago

Yes, but I still need to update the four final datasets. I can create a PR for the work I have already done.

lacava commented 1 year ago

ping on this @alexzwanenburg , hopefully we could pick up where you left off if you create a PR

gkronber commented 1 year ago

@alexzwanenburg I'm ready to help finish this PR. Is your fork up-to-date with your changes documented in this issue?

alexzwanenburg commented 11 months ago

I made a PR. I haven't addressed the last four datasets.

EpistasisLab / pmlb

Duplicate datasets. #167

20

75

84

159