autogluon / tabrepo

Apache License 2.0
43 stars 9 forks source link

Discrepancy in Datasets Used in Paper and Repo #62

Closed Voulgaris-Sot closed 5 months ago

Voulgaris-Sot commented 6 months ago

Description: I recently reviewed the TabRepo paper associated with this repository and found that it mentions "we use 105 binary classification datasets, 67 multi-class classification datasets and 28 regression datasets" for a total of 200 datasets. This is also reflected in Tables 4, 5 and 6 of the paper. However, upon examining the code in this repository, I noticed that the actual distribution for the D244_F3_C1530_200 context - which was used for the results in paper - is 105 binary classification datasets, 68 multi-class classification datasets and 27 regression datasets, again for a total of 200 datasets. The difference is that the paper mentions the regression dataset Buzzinsocialmedia_Twitter instead of the multiclass dataset volkert.

How to check: The D244_F3_C1530_200 context is defined in context_2023_11_14.py and contains the the last 200 elements of the datasets list defined in context_2023_08_21.py. The list is sorted from largest to smallest dataset and contains 211 datasets in total. The paper mentions that "we filter out the 11 largest datasets for practical usability purposes of TabRepo" and that's how we arrive at the 200 datasets used in the analysis. However if we trust the list to be sorted then the Buzzinsocialmedia_Twitter dataset is bigger than the volkert dataset and should not be included in the final 200 datasets. Moreover, when you download the D244_F3_C1530_200 context the volkert dataset is present and not the Buzzinsocialmedia_Twitter which is different from the claims of the paper.

Summary: The paper mentions that the regression dataset Buzzinsocialmedia_Twitter was used. Instead, the code suggests that the multi-class dataset volkert was used in the analysis.

Thank you for your attention and let me know if my understanding is correct.

Below is a basic reproduction of the issue:

import pandas as pd

df = pd.read_csv("https://raw.githubusercontent.com/autogluon/tabrepo/main/data/metadata/task_metadata_244.csv")

# Fix the naming discrepancy between the metadata and the dataset list
# the difference is that the metadata name contains dot. In the datasets list the dots are swapped for _
# replace all . with _
df['name'] = df['name'].str.replace('.','_') 

# Copied from tabrepo/tabrepo/contexts/context_2023_08_21.py

# 244 datasets sorted largest to smallest
datasets = [
    # Commented out due to excessive size
    # "dionis",  # Cumulative Size: 4.5 TB (244 datasets)
    # "KDDCup99",
    # "Airlines_DepDelay_10M",
    # "Kuzushiji-49",
    # "pokerhand",
    # "sf-police-incidents",
    # "helena",  # Cumulative Size: 1.0 TB (238 datasets)
    # "covertype",
    # "Devnagari-Script",
    # "Higgs",
    # "walking-activity",
    # "spoken-arabic-digit",
    # "GTSRB-HOG01",
    # "GTSRB-HOG02",
    # "GTSRB-HOG03",
    # "GTSRB-HueHist",

    "porto-seguro",  # Cumulative Size: 455 GB (228 datasets)  # 211 datasets (succeeded)
    "airlines",
    "ldpa",
    "albert",
    "tamilnadu-electricity",
    "fars",
    "Buzzinsocialmedia_Twitter",
    "nyc-taxi-green-dec-2016",
    "Fashion-MNIST",
    "Kuzushiji-MNIST",
    "mnist_784",
    # "CIFAR_10",  # Failed
    "volkert",  # 200 datasets (succeeded)
    "Yolanda",
    "letter",
    "kr-vs-k",
    "kropt",
    "MiniBooNE",
    "shuttle",
    "jannis",
    "numerai28_6",
    "Diabetes130US",
    "Run_or_walk_information",
    "APSFailure",
    "kick",
    "Allstate_Claims_Severity",
    "Traffic_violations",
    "black_friday",
    "connect-4",  # Cumulative Size: 107 GB (200 datasets)
    "isolet",
    "adult",
    "okcupid-stem",
    "electricity",
    "bank-marketing",
    # "KDDCup09-Upselling",  # Failed
    # "one-hundred-plants-margin",  # Failed
    # "KDDCup09_appetency",  # Failed
    "jungle_chess_2pcs_raw_endgame_complete",
    "2dplanes",
    "fried",
    "Click_prediction_small",  # 175 datasets (succeeded)
    "nomao",
    "Amazon_employee_access",
    "pendigits",
    "microaggregation2",
    "artificial-characters",
    "robert",
    "houses",
    "Indian_pines",
    "diamonds",
    # "guillermo",  # Failed
    # "riccardo",  # Failed
    # "MagicTelescope",  # Failed
    # "nursery",  # Failed  # Cumulative Size: 50 GB (175 datasets)
    "har",
    "texture",
    "fabert",
    "optdigits",
    "mozilla4",
    "volcanoes-b2",
    "eeg-eye-state",
    "volcanoes-b1",
    "OnlineNewsPopularity",
    "volcanoes-b6",
    "dilbert",
    "volcanoes-b5",
    "GesturePhaseSegmentationProcessed",
    "ailerons",
    "volcanoes-d1",
    "volcanoes-d4",
    "mammography",
    "PhishingWebsites",
    "satimage",
    "jm1",
    "first-order-theorem-proving",
    "kdd_internet_usage",
    "eye_movements",
    "wine-quality-white",
    "delta_elevators",
    "mc1",
    "led24",
    "visualizing_soil",
    "house_16H",
    "SpeedDating",
    "bank32nh",
    "bank8FM",
    "cpu_act",
    "cpu_small",
    "kin8nm",
    "puma32H",
    "puma8NH",
    "collins",
    "house_sales",
    "page-blocks",
    "ringnorm",
    "twonorm",
    "delta_ailerons",
    "wind",
    "wall-robot-navigation",
    "elevators",
    "cardiotocography",
    "philippine",
    "pc2",
    "mfeat-factors",
    # "christine",  # Failed
    "phoneme",
    "sylvine",
    "Satellite",
    "pol",
    "churn",
    "wilt",
    "spambase",
    "segment",
    "waveform-5000",
    # "hypothyroid",  # Failed
    "semeion",
    "hiva_agnostic",
    "ada",
    # "yeast",  # Failed
    "Brazilian_houses",
    "steel-plates-fault",
    "pollen",
    "Bioresponse",  # 100 datasets (succeeded)
    "soybean",
    "Internet-Advertisements",
    "topo_2_1",
    "yprop_4_1",
    "UMIST_Faces_Cropped",
    "madeline",  # Cumulative Size: 8.7 GB (100 datasets)
    "micro-mass",
    "gina",
    "jasmine",
    "splice",
    "dna",
    "wine-quality-red",
    "cnae-9",
    "colleges",
    "madelon",
    "ozone-level-8hr",
    "MiceProtein",
    "volcanoes-a2",
    "volcanoes-a3",
    "Titanic",
    "wine_quality",
    "volcanoes-a4",
    "kc1",
    # "eating",  # Failed
    "car",
    # "QSAR-TID-10980",  # Failed
    # "QSAR-TID-11",  # Failed
    "pbcseq",
    "volcanoes-e1",
    "autoUniv-au6-750",
    # "Santander_transaction_value",  # Failed
    "SAT11-HAND-runtime-regression",
    "GAMETES_Epistasis_2-Way_20atts_0_1H_EDM-1_1",
    "GAMETES_Epistasis_2-Way_1000atts_0_4H_EDM-1_EDM-1_1",
    "GAMETES_Epistasis_2-Way_20atts_0_4H_EDM-1_1",
    "GAMETES_Epistasis_3-Way_20atts_0_2H_EDM-1_1",
    "GAMETES_Heterogeneity_20atts_1600_Het_0_4_0_2_50_EDM-2_001",
    "GAMETES_Heterogeneity_20atts_1600_Het_0_4_0_2_75_EDM-2_001",
    "autoUniv-au7-1100",
    "pc3",
    "Mercedes_Benz_Greener_Manufacturing",
    "OVA_Prostate",
    "OVA_Endometrium",
    "OVA_Kidney",
    "OVA_Lung",
    "OVA_Ovary",
    "pc4",
    # "OVA_Breast",  # Failed
    "OVA_Colon",
    "abalone",
    "LED-display-domain-7digit",
    "analcatdata_dmft",
    "cmc",
    "colleges_usnews",
    # "anneal",  # Failed
    "baseball",
    "hill-valley",
    "space_ga",
    "parity5_plus_5",
    "pc1",
    "eucalyptus",
    "qsar-biodeg",
    "synthetic_control",
    "fri_c0_1000_5",
    "fri_c1_1000_50",
    "fri_c2_1000_25",
    "fri_c3_1000_10",
    "fri_c3_1000_25",
    "autoUniv-au1-1000",
    "credit-g",
    "vehicle",
    "analcatdata_authorship",
    "tokyo1",
    "quake",
    "kdd_el_nino-small",
    "diabetes",  # Cumulative Size: 1.0 GB (30 datasets)
    "blood-transfusion-service-center",
    "us_crime",
    "Australian",
    "autoUniv-au7-700",
    "ilpd",
    "balance-scale",
    "arsenic-female-bladder",
    "climate-model-simulation-crashes",
    "cylinder-bands",
    "meta",
    "house_prices_nominal",
    "kc2",
    "rmftsa_ladata",
    "boston_corrected",
    "fri_c0_500_5",
    "fri_c2_500_50",
    "fri_c3_500_10",
    "fri_c4_500_100",
    "no2",
    "pm10",
    "dresses-sales",
    "fri_c3_500_50",
    "Moneyball",
    "socmob",
    "MIP-2016-regression",
    "sensory",
    "boston",
    "arcene",
    "tecator",
]

# filter out the 11 biggest datasets
datasets_200 = datasets[-200:]
df = df[df.name.isin(datasets[-200:])]
print(df.shape) # make sure we have all 200 datasets

binary = df[(df.NumberOfClasses==2.0) & (df.task_type == "Supervised Classification")]
multiclass = df[(df.NumberOfClasses>2.0) & (df.task_type == "Supervised Classification")]
regression = df[df.task_type == "Supervised Regression"]

print(f"Number of Binary datasets: {len(binary)}")
print(f"Number of Multiclass datasets: {len(multiclass)}")
print(f"Number of Regression datasets: {len(regression)}")
Innixma commented 5 months ago

Hi @Voulgaris-Sot! Thanks for the deep dive!

Indeed it might be the case that there is a slight difference in what datasets were used vs what is listed in the paper. Unsure how this happened, but we will plan to investigate and update to the corrected list for the camera ready version.

Innixma commented 5 months ago

Running the dataset_analysis.py file which generates the tables, it actually generates the corrected version that you mention. It seems that the tables in the paper were from a slightly older version of the script and should be updated.

I've gone and made the following paper edits:

In total, we use 105 binary classification datasets, 68 multi-class classification datasets and 27 regression datasets.

And updated tables 5 & 6:

tabrepo_corrected_table_5_6

Note that table 3 was already correct and based on the corrected datasets, but we missed updating the tables 5 and 6.

Thanks again for spotting and reporting this @Voulgaris-Sot! We will update the arxiv with this fix alongside the camera ready version release for AutoML 2024.