Closed Voulgaris-Sot closed 5 months ago
Hi @Voulgaris-Sot! Thanks for the deep dive!
Indeed it might be the case that there is a slight difference in what datasets were used vs what is listed in the paper. Unsure how this happened, but we will plan to investigate and update to the corrected list for the camera ready version.
Running the dataset_analysis.py file which generates the tables, it actually generates the corrected version that you mention. It seems that the tables in the paper were from a slightly older version of the script and should be updated.
I've gone and made the following paper edits:
In total, we use 105 binary classification datasets, 68 multi-class classification datasets and 27 regression datasets.
And updated tables 5 & 6:
Note that table 3 was already correct and based on the corrected datasets, but we missed updating the tables 5 and 6.
Thanks again for spotting and reporting this @Voulgaris-Sot! We will update the arxiv with this fix alongside the camera ready version release for AutoML 2024.
Description: I recently reviewed the TabRepo paper associated with this repository and found that it mentions "we use 105 binary classification datasets, 67 multi-class classification datasets and 28 regression datasets" for a total of 200 datasets. This is also reflected in Tables 4, 5 and 6 of the paper. However, upon examining the code in this repository, I noticed that the actual distribution for the D244_F3_C1530_200 context - which was used for the results in paper - is 105 binary classification datasets, 68 multi-class classification datasets and 27 regression datasets, again for a total of 200 datasets. The difference is that the paper mentions the regression dataset
Buzzinsocialmedia_Twitter
instead of the multiclass datasetvolkert
.How to check: The D244_F3_C1530_200 context is defined in context_2023_11_14.py and contains the the last 200 elements of the
datasets
list defined in context_2023_08_21.py. The list is sorted from largest to smallest dataset and contains 211 datasets in total. The paper mentions that "we filter out the 11 largest datasets for practical usability purposes of TabRepo" and that's how we arrive at the 200 datasets used in the analysis. However if we trust the list to be sorted then theBuzzinsocialmedia_Twitter
dataset is bigger than thevolkert
dataset and should not be included in the final 200 datasets. Moreover, when you download the D244_F3_C1530_200 context thevolkert
dataset is present and not theBuzzinsocialmedia_Twitter
which is different from the claims of the paper.Summary: The paper mentions that the regression dataset
Buzzinsocialmedia_Twitter
was used. Instead, the code suggests that the multi-class datasetvolkert
was used in the analysis.Thank you for your attention and let me know if my understanding is correct.
Below is a basic reproduction of the issue: