Open folivetti opened 1 year ago
Like we discussed in our meeting, I'll make a first pass on the current results and identify potential datasets to remove from the benchmark and some possibilities to select and categorize them.
@lacava @foolnotion @MilesCranmer Some initial stats about PMLB! I've checked each dataset to see whether the targer column was discrete or continuous and whether they had only positive values:
Count by domain:
domain
R 74
R+ 15
Z 1
Z+ 30
{0, 1} 2
Name: dataset, dtype: int64
Count by type:
type
continuous 89
discrete 33
Name: dataset, dtype: int64
titanic, banana
are actually classification problems. All others different from R
may be from other distributions (poisson, gamma, etc.)
I've also checked the uniqueness of the Z+ datasets:
Uniqueness ratios below 100% of Z+ data:
dataset uniqueness_ratio unique_values
0 1027_ESL 0.018443 9
1 1028_SWD 0.004000 4
2 1029_LEV 0.005000 5
3 1030_ERA 0.009000 9
4 1089_USCrime 0.893617 42
12 1595_poker 0.000010 10
14 195_auto_price 0.911950 145
15 197_cpu_act 0.006836 56
16 201_pol 0.000733 11
17 207_autoPrice 0.911950 145
20 218_house_8L 0.089756 2045
22 227_cpu_small 0.006836 56
25 230_machine_cpu 0.555024 116
26 294_satellite_image 0.000932 6
29 485_analcatdata_vehicle 0.979167 47
32 519_vinnie 0.042105 16
34 523_analcatdata_neavote 0.080000 8
37 537_houses 0.186143 3842
40 556_analcatdata_apnea2 0.374737 178
41 557_analcatdata_apnea1 0.345263 164
43 561_cpu 0.497608 104
44 562_cpu_small 0.006836 56
46 573_cpu_act 0.006836 56
47 574_house_16H 0.089756 2045
112 665_sleuth_case2002 0.129252 19
115 687_sleuth_ex1605 0.661290 41
116 690_visualizing_galaxy 0.643963 208
119 712_chscase_geyser1 0.225225 50
Some of them looks like a multiclass problem. Plotting the median of medians of R^2 errorbar we get:
indeed most algorithms perform quite well for R
and not that good for the other problems, possibly because they are only fitting least squares disregarding other possibilities.
My suggestion is that we may remove those from SRBench 3.0 and reinsert them in SRBench 4.0 under a different track (non-gaussian distributions).
In another analysis, I picked the top-10 algorithms wrt r2_test and removed from the list of datasets those that:
With this procedure we end up with 85 datasets. If we only keep those with domain R
we end up with 58 datasets (all of them friedman :-) ).
Some remarks on:
Uniqueness ratios below 100% of Z+ data:
dataset uniqueness_ratio unique_values
0 1027_ESL 0.018443 9
1 1028_SWD 0.004000 4
2 1029_LEV 0.005000 5
3 1030_ERA 0.009000 9
4 1089_USCrime 0.893617 42
12 1595_poker 0.000010 10
14 195_auto_price 0.911950 145
15 197_cpu_act 0.006836 56
16 201_pol 0.000733 11
17 207_autoPrice 0.911950 145
20 218_house_8L 0.089756 2045
22 227_cpu_small 0.006836 56
25 230_machine_cpu 0.555024 116
26 294_satellite_image 0.000932 6
29 485_analcatdata_vehicle 0.979167 47
32 519_vinnie 0.042105 16
34 523_analcatdata_neavote 0.080000 8
37 537_houses 0.186143 3842
40 556_analcatdata_apnea2 0.374737 178
41 557_analcatdata_apnea1 0.345263 164
43 561_cpu 0.497608 104
44 562_cpu_small 0.006836 56
46 573_cpu_act 0.006836 56
47 574_house_16H 0.089756 2045
112 665_sleuth_case2002 0.129252 19
115 687_sleuth_ex1605 0.661290 41
116 690_visualizing_galaxy 0.643963 208
119 712_chscase_geyser1 0.225225 50
_visualizing_
identifier is: William S. Cleveland: Visualizing Data
690_visualizing_galaxy
: target is "velocity relative to the earth"_analcatdata_
identifier is: Jeffrey S. Simonoff: Analyzing Categorical Data
485_analcatdata_vehicle
: "Margolis et al. (2000) analyzed data for 1991-1996 from the Fatality Analysis Reporting System, a nationwide registry of motor vehicle deaths in the United States. They reported deaths of children (younger than 16) cross-classified by type (passenger or pedestrian/bicyclist), gender, age, and whether the accident was alcohol related. The data are given in the file vehicle
."523_analcatdata_neavote
: "Many political, civic, and business organizations lobby state and federal legislators to try to influence their votes on key legislative matters. These organizations often track the voting patterns of legislators in the form of "report cards," which summarize the votes of legislators on key pieces of legislation. The file neavote summarizes the 1999 U.S. Senate legislative report card produced by the National Education Association (NEA, the nation's oldest and largest organization committed to advancing the cause of public education), based on 10 Senate votes during the year, and was obtained from the NEA's web site (www.nea.org)"557_analcatdata_apnea1
: "An automated algorithm was also used to classify time periods into the same five categories using nasal pressure and mechanical respiratory input impedance." "The file apneal contains the results of comparisons of second-by-second evaluations by the two scorers for each of the 19 subjects". A more elaborate description is given in the book.556_analcatdata_apnea2
: "The files apnea2 and apnea3 give results comparing the automated algorithm to scorer 1 and scorer 2, respectively. (The scorer values are not directly comparable to those in apneal, because they were generated based on cross-validated test sets."These files (and many others in PMLB) were most likely taken verbatim from StatLib (http://lib.stat.cmu.edu/datasets/) which references the sources.
Also relevant is the effort by @alexzwanenburg (https://github.com/EpistasisLab/pmlb/pull/180) who invested a lot of time to identify duplicates and clean up some of the datasets in PMLB.
Thanks @gkronber this was going to be my next step (search duplication), so this PR will make my life much easier :-)
~there is a PR on the PMLB repo that fixes many of these issues. might be ag ood place to start~ oh, i see @gkronber mentioned this
On a recent study (https://dl.acm.org/doi/abs/10.1145/3597312) I've noticed that the difference between the top-N (N = 15 or more) algorithms in most datasets are insignificant. They only differ on a small selection of the Friedman datasets. Maybe it is a good idea to separate the comparison of the algorithms in different groups:
Given this, my other proposal is to add the benchmarks of those two competitions into the benchmark and the one proposed by @MilesCranmer. For the 2023 competition I can also generate datasets with different levels of noise and other nasty features! Also, we can grab other benchmark functions from multimodal optimization to create more of those.