EpistasisLab / pmlb

PMLB: A large, curated repository of benchmark datasets for evaluating supervised machine learning algorithms.
https://epistasislab.github.io/pmlb/
MIT License
805 stars 135 forks source link

Questions about parity5_plus_5 #179

Open amueller opened 1 year ago

amueller commented 1 year ago

Would it be possible to get a description of the parity5_plus_5 dataset? There's several things that are confusing about it for me. First, there are some duplicate rows, which seems odd. The rows count from 0 to 1023 in binary, and there are 1124 rows in the dataset, meaning there are 100 duplicate rows.

Also, I'm not sure I understand the name of the dataset. The equation for the class label seems to be

data['class'] == data[['Bit_2', 'Bit_3', 'Bit_4', 'Bit_6', 'Bit_8']].sum(axis=1) % 2

but I'm not sure what the intuition behind this is or how it relates to the name. I assume there's some simple binary formula behind this, but I don't immediately see it. Or is it just referring to the fact that the other five bits don't influence the outcome?

lacava commented 1 year ago

@ryanurbs do you happen to know the equation for this dataset?

amueller commented 1 year ago

I think the explanation is actually just that there's a subset of 5 bits whose parity is computed and the other bits are ignored. but I'm still confused by the duplication of some rows.

ryanurbs commented 1 year ago

@lacava @amueller I'm looking into getting a definitive answer to your question. We received this dataset from a colleague.

ryanurbs commented 1 year ago

@lacava @amueller I found a published description of the parity5+5 problem here: https://sci2s.ugr.es/keel/pdf/algorithm/congreso/liu-3.pdf

You are indeed correct that only 5 of the features are relevant (Bits 2,3,4,6,8) and the other 5 are randomly generated. The underlying predictive pattern in this dataset is that if there are an even number of zeros across those features, then the outcome is 1, otherwise 0. I'm not sure why there are extra redundant rows in this dataset, as there should be 1024 unique rows as described in the above paper as well. I'm not certain of the exact origins of this particular dataset so it might not be possible to track down where the extra rows came from, but you might just remove the redundant rows depending on what experiment you are looking to run. The name parity5+5 comes from the fact that this dataset is basically the original parity5 problem with 5 irrelevant features added to it.

amueller commented 1 year ago

@ryanurbs thank you for the explanation. Interesting to know that the published version only has 1024 rows, so this might have been some processing mix-up along the way. Feel free to close. I was asking for openml.org where we might decide to drop the duplicate rows in a new version of the dataset.