Open alexzwanenburg opened 2 years ago
Thank you so much for your detailed investigation of the dataset collection @alexzwanenburg! Would you have the bandwidth to make a PR to address (even part of) the duplications?
Yes I can create the PR to address this issue, it may take a few weeks to fully address these issues though.
I have two questions:
deprecated
tag to the metadata yaml file that, if present, refers to the new dataset. I would expect this to have the following behaviour:
deprecated
tag is present, and not empty (~
or null
), the dataset will no longer be visible on the PMLB GitHub Pages.deprecated
tag is present, and not empty (~
or null
), fetching the dataset will produce a warning.license
tag to the metadata to document this, e.g. license: CC-BY-4.0
?all those suggestions look good to me.
Hi @alexzwanenburg , thanks again for your work spearheading this. Do you still plan to make a PR for these changes? 🙏
Yes, but I still need to update the four final datasets. I can create a PR for the work I have already done.
ping on this @alexzwanenburg , hopefully we could pick up where you left off if you create a PR
@alexzwanenburg I'm ready to help finish this PR. Is your fork up-to-date with your changes documented in this issue?
I made a PR. I haven't addressed the last four datasets.
While trying to identify which data sets from the modeldata R package are already present in pmlb, I found that quite a few datasets are duplicates or simple subsets of other datasets.
cmc
andcontraceptive
datasets.symboling
feature underwent a shift between both datasets. Note: the underlying dataset seems to be the same as the one used for auto. The difference between the datasets is the target, which is price for195_auto_price
and207_autoPrice
, and symboling forauto
, as well as how missing values were removed. The original dataset may be found on the UCI ML repository.Description
of each new dataset references the other.195_auto_price
,207_autoPrice
andauto
datasets.glass
andprnn_fglass
datasets.cleve
andheart_c
data sets have a binarized target (vs. ordinal in the other two datasets); thecleveland_nominal
data set contains only a feature subset. The original can be found on the UCI ML repository.cleve
data set.heart_c
,cleve
,cleveland_nominal
,cleveland
,heart_statlog
,heart_h
andhungarian
datasets.colic
andhorse_colic
datasets.vote
andhouse_votes_84
datasets.breast_cancer_wisconsin
andwdbc
datasets.australian
,buggyCrx
,credit_a
andcrx
datasets.breast
dataset has aSample code number
feature that is not present inbreast_w
. The original can be found on the UCI ML repository.breast_w
andbreast
datasets.Parse data from the original into the expected format.diabetes
andpima
datasets.credit_g
andgerman
datasets.solar_flare_2
also contains two additional features.solar_flare_2
are in fact the other two targets.solar_flare_2
andflare
datasets.car_evaluation
dataset several categorical (ordinal) features fromcar
are one-hot-encoded. The original can be found on the UCI ML repository. This issue was also mention in #84.car
andcar_evaluation
datasets.chess
andkr_vs_kp
datasets.294_satellite_image
incorrectly specifies a regression problem. The original can be found on the UCI ML repository, and has multiple (6) classes as target.satimage
and294_satellite_image
datasets.227_cpu_small
and562_cpu_small
have fewer features.197_cpu_act
,227_cpu_small
,562_cpu_small
and573_cpu_act
datasets.poker
and1595_poker
datasets.My proposal is to remove duplicates, using an original dataset where this can be found. This might also address the following issues:
20
75
84
159