New datasets + reorganization of current benchmarks

On a recent study (https://dl.acm.org/doi/abs/10.1145/3597312) I've noticed that the difference between the top-N (N = 15 or more) algorithms in most datasets are insignificant. They only differ on a small selection of the Friedman datasets. Maybe it is a good idea to separate the comparison of the algorithms in different groups:

Blackbox datasets
Friedman
Feynman + Strogatz
2022 competition
2023 competition (with more datasets than those used)
https://arxiv.org/pdf/2305.01582.pdf by @MilesCranmer

Given this, my other proposal is to add the benchmarks of those two competitions into the benchmark and the one proposed by @MilesCranmer. For the 2023 competition I can also generate datasets with different levels of noise and other nasty features! Also, we can grab other benchmark functions from multimodal optimization to create more of those.

Like we discussed in our meeting, I'll make a first pass on the current results and identify potential datasets to remove from the benchmark and some possibilities to select and categorize them.

@lacava @foolnotion @MilesCranmer Some initial stats about PMLB! I've checked each dataset to see whether the targer column was discrete or continuous and whether they had only positive values:

Count by domain: 
domain
R         74
R+        15
Z          1
Z+        30
{0, 1}     2
Name: dataset, dtype: int64

Count by type: 
type
continuous    89
discrete      33
Name: dataset, dtype: int64

titanic, banana are actually classification problems. All others different from R may be from other distributions (poisson, gamma, etc.)

I've also checked the uniqueness of the Z+ datasets:

Uniqueness ratios below 100% of Z+ data: 
                     dataset  uniqueness_ratio  unique_values
0                   1027_ESL          0.018443              9
1                   1028_SWD          0.004000              4
2                   1029_LEV          0.005000              5
3                   1030_ERA          0.009000              9
4               1089_USCrime          0.893617             42
12                1595_poker          0.000010             10
14            195_auto_price          0.911950            145
15               197_cpu_act          0.006836             56
16                   201_pol          0.000733             11
17             207_autoPrice          0.911950            145
20              218_house_8L          0.089756           2045
22             227_cpu_small          0.006836             56
25           230_machine_cpu          0.555024            116
26       294_satellite_image          0.000932              6
29   485_analcatdata_vehicle          0.979167             47
32                519_vinnie          0.042105             16
34   523_analcatdata_neavote          0.080000              8
37                537_houses          0.186143           3842
40    556_analcatdata_apnea2          0.374737            178
41    557_analcatdata_apnea1          0.345263            164
43                   561_cpu          0.497608            104
44             562_cpu_small          0.006836             56
46               573_cpu_act          0.006836             56
47             574_house_16H          0.089756           2045
112      665_sleuth_case2002          0.129252             19
115        687_sleuth_ex1605          0.661290             41
116   690_visualizing_galaxy          0.643963            208
119      712_chscase_geyser1          0.225225             50

Some of them looks like a multiclass problem. Plotting the median of medians of R^2 errorbar we get:

plot_domain

indeed most algorithms perform quite well for R and not that good for the other problems, possibly because they are only fitting least squares disregarding other possibilities.

My suggestion is that we may remove those from SRBench 3.0 and reinsert them in SRBench 4.0 under a different track (non-gaussian distributions).

In another analysis, I picked the top-10 algorithms wrt r2_test and removed from the list of datasets those that:

the smallest r2_test among these 10 algorithms was greater than 0.9 (so all of them were competent to find the solution) or
the standard deviation of r2_test for these 10 algorithms was smaller than 0.05 (there's not much variation among the top-10, so maybe nobody can find a better solution)

With this procedure we end up with 85 datasets. If we only keep those with domain R we end up with 58 datasets (all of them friedman :-) ).

Some remarks on:

Uniqueness ratios below 100% of Z+ data: 
                     dataset  uniqueness_ratio  unique_values
0                   1027_ESL          0.018443              9
1                   1028_SWD          0.004000              4
2                   1029_LEV          0.005000              5
3                   1030_ERA          0.009000              9
4               1089_USCrime          0.893617             42
12                1595_poker          0.000010             10
14            195_auto_price          0.911950            145
15               197_cpu_act          0.006836             56
16                   201_pol          0.000733             11
17             207_autoPrice          0.911950            145
20              218_house_8L          0.089756           2045
22             227_cpu_small          0.006836             56
25           230_machine_cpu          0.555024            116
26       294_satellite_image          0.000932              6
29   485_analcatdata_vehicle          0.979167             47
32                519_vinnie          0.042105             16
34   523_analcatdata_neavote          0.080000              8
37                537_houses          0.186143           3842
40    556_analcatdata_apnea2          0.374737            178
41    557_analcatdata_apnea1          0.345263            164
43                   561_cpu          0.497608            104
44             562_cpu_small          0.006836             56
46               573_cpu_act          0.006836             56
47             574_house_16H          0.089756           2045
112      665_sleuth_case2002          0.129252             19
115        687_sleuth_ex1605          0.661290             41
116   690_visualizing_galaxy          0.643963            208
119      712_chscase_geyser1          0.225225             50

207_autoPrice and 195_auto_price are the same.
573_cpu_act and 562_cpu_small are the same.
Source for datasets with _visualizing_ identifier is: William S. Cleveland: Visualizing Data
- 690_visualizing_galaxy: target is "velocity relative to the earth"
Source for datasets with _analcatdata_ identifier is: Jeffrey S. Simonoff: Analyzing Categorical Data
- 485_analcatdata_vehicle: "Margolis et al. (2000) analyzed data for 1991-1996 from the Fatality Analysis Reporting System, a nationwide registry of motor vehicle deaths in the United States. They reported deaths of children (younger than 16) cross-classified by type (passenger or pedestrian/bicyclist), gender, age, and whether the accident was alcohol related. The data are given in the file vehicle."
- 523_analcatdata_neavote: "Many political, civic, and business organizations lobby state and federal legislators to try to influence their votes on key legislative matters. These organizations often track the voting patterns of legislators in the form of "report cards," which summarize the votes of legislators on key pieces of legislation. The file neavote summarizes the 1999 U.S. Senate legislative report card produced by the National Education Association (NEA, the nation's oldest and largest organization committed to advancing the cause of public education), based on 10 Senate votes during the year, and was obtained from the NEA's web site (www.nea.org)"
- 557_analcatdata_apnea1: "An automated algorithm was also used to classify time periods into the same five categories using nasal pressure and mechanical respiratory input impedance." "The file apneal contains the results of comparisons of second-by-second evaluations by the two scorers for each of the 19 subjects". A more elaborate description is given in the book.
- 556_analcatdata_apnea2: "The files apnea2 and apnea3 give results comparing the automated algorithm to scorer 1 and scorer 2, respectively. (The scorer values are not directly comparable to those in apneal, because they were generated based on cross-validated test sets."

These files (and many others in PMLB) were most likely taken verbatim from StatLib (http://lib.stat.cmu.edu/datasets/) which references the sources.

Also relevant is the effort by @alexzwanenburg (https://github.com/EpistasisLab/pmlb/pull/180) who invested a lot of time to identify duplicates and clean up some of the datasets in PMLB.

Thanks @gkronber this was going to be my next step (search duplication), so this PR will make my life much easier :-)

~there is a PR on the PMLB repo that fixes many of these issues. might be ag ood place to start~ oh, i see @gkronber mentioned this

cavalab / srbench

New datasets + reorganization of current benchmarks #153