cavalab / srbench

A living benchmark framework for symbolic regression
https://cavalab.org/srbench/
GNU General Public License v3.0
216 stars 75 forks source link

New datasets + reorganization of current benchmarks #153

Open folivetti opened 1 year ago

folivetti commented 1 year ago

On a recent study (https://dl.acm.org/doi/abs/10.1145/3597312) I've noticed that the difference between the top-N (N = 15 or more) algorithms in most datasets are insignificant. They only differ on a small selection of the Friedman datasets. Maybe it is a good idea to separate the comparison of the algorithms in different groups:

Given this, my other proposal is to add the benchmarks of those two competitions into the benchmark and the one proposed by @MilesCranmer. For the 2023 competition I can also generate datasets with different levels of noise and other nasty features! Also, we can grab other benchmark functions from multimodal optimization to create more of those.

folivetti commented 7 months ago

Like we discussed in our meeting, I'll make a first pass on the current results and identify potential datasets to remove from the benchmark and some possibilities to select and categorize them.

folivetti commented 7 months ago

@lacava @foolnotion @MilesCranmer Some initial stats about PMLB! I've checked each dataset to see whether the targer column was discrete or continuous and whether they had only positive values:

Count by domain: 
domain
R         74
R+        15
Z          1
Z+        30
{0, 1}     2
Name: dataset, dtype: int64

Count by type: 
type
continuous    89
discrete      33
Name: dataset, dtype: int64

titanic, banana are actually classification problems. All others different from R may be from other distributions (poisson, gamma, etc.)

I've also checked the uniqueness of the Z+ datasets:

Uniqueness ratios below 100% of Z+ data: 
                     dataset  uniqueness_ratio  unique_values
0                   1027_ESL          0.018443              9
1                   1028_SWD          0.004000              4
2                   1029_LEV          0.005000              5
3                   1030_ERA          0.009000              9
4               1089_USCrime          0.893617             42
12                1595_poker          0.000010             10
14            195_auto_price          0.911950            145
15               197_cpu_act          0.006836             56
16                   201_pol          0.000733             11
17             207_autoPrice          0.911950            145
20              218_house_8L          0.089756           2045
22             227_cpu_small          0.006836             56
25           230_machine_cpu          0.555024            116
26       294_satellite_image          0.000932              6
29   485_analcatdata_vehicle          0.979167             47
32                519_vinnie          0.042105             16
34   523_analcatdata_neavote          0.080000              8
37                537_houses          0.186143           3842
40    556_analcatdata_apnea2          0.374737            178
41    557_analcatdata_apnea1          0.345263            164
43                   561_cpu          0.497608            104
44             562_cpu_small          0.006836             56
46               573_cpu_act          0.006836             56
47             574_house_16H          0.089756           2045
112      665_sleuth_case2002          0.129252             19
115        687_sleuth_ex1605          0.661290             41
116   690_visualizing_galaxy          0.643963            208
119      712_chscase_geyser1          0.225225             50

Some of them looks like a multiclass problem. Plotting the median of medians of R^2 errorbar we get:

plot_domain

indeed most algorithms perform quite well for R and not that good for the other problems, possibly because they are only fitting least squares disregarding other possibilities.

My suggestion is that we may remove those from SRBench 3.0 and reinsert them in SRBench 4.0 under a different track (non-gaussian distributions).

In another analysis, I picked the top-10 algorithms wrt r2_test and removed from the list of datasets those that:

With this procedure we end up with 85 datasets. If we only keep those with domain R we end up with 58 datasets (all of them friedman :-) ).

gkronber commented 7 months ago

Some remarks on:

Uniqueness ratios below 100% of Z+ data: 
                     dataset  uniqueness_ratio  unique_values
0                   1027_ESL          0.018443              9
1                   1028_SWD          0.004000              4
2                   1029_LEV          0.005000              5
3                   1030_ERA          0.009000              9
4               1089_USCrime          0.893617             42
12                1595_poker          0.000010             10
14            195_auto_price          0.911950            145
15               197_cpu_act          0.006836             56
16                   201_pol          0.000733             11
17             207_autoPrice          0.911950            145
20              218_house_8L          0.089756           2045
22             227_cpu_small          0.006836             56
25           230_machine_cpu          0.555024            116
26       294_satellite_image          0.000932              6
29   485_analcatdata_vehicle          0.979167             47
32                519_vinnie          0.042105             16
34   523_analcatdata_neavote          0.080000              8
37                537_houses          0.186143           3842
40    556_analcatdata_apnea2          0.374737            178
41    557_analcatdata_apnea1          0.345263            164
43                   561_cpu          0.497608            104
44             562_cpu_small          0.006836             56
46               573_cpu_act          0.006836             56
47             574_house_16H          0.089756           2045
112      665_sleuth_case2002          0.129252             19
115        687_sleuth_ex1605          0.661290             41
116   690_visualizing_galaxy          0.643963            208
119      712_chscase_geyser1          0.225225             50

These files (and many others in PMLB) were most likely taken verbatim from StatLib (http://lib.stat.cmu.edu/datasets/) which references the sources.

Also relevant is the effort by @alexzwanenburg (https://github.com/EpistasisLab/pmlb/pull/180) who invested a lot of time to identify duplicates and clean up some of the datasets in PMLB.

folivetti commented 7 months ago

Thanks @gkronber this was going to be my next step (search duplication), so this PR will make my life much easier :-)

lacava commented 7 months ago

~there is a PR on the PMLB repo that fixes many of these issues. might be ag ood place to start~ oh, i see @gkronber mentioned this