cavalab / srbench

A living benchmark framework for symbolic regression
https://cavalab.org/srbench/
GNU General Public License v3.0
216 stars 77 forks source link

Ground-truth datasets are broken? #54

Closed yoshitomo-matsubara closed 2 years ago

yoshitomo-matsubara commented 3 years ago

Hi!

Thank you for your great work and framework! I wanted to try the benchmarked methods for the ground-truth datasets (i.e., Feynman and Strogatz datasets) and followed the instructions in README.

Is each of the datasets not in gzip format?

However, the datasets fetched from the pmlb repository look broken. Here is one of the errors I got when running
python analyze.py -results ../results_sym_data -target_noise 0.0 "/data/pmlb/datasets/strogatz*" -sym_data -n_trials 10 -time_limit 9:00 -tuned --local
for Strogatz dataset. (Same errors occurred for Feynman dataset by "/data/pmlb/datasets/feynman_*" as well)

========================================
Evaluating tuned.FEATRegressor on
/data/pmlb/datasets/strogatz_bacres1/strogatz_bacres1.tsv.gz
========================================
compression: gzip
filename: /data/pmlb/datasets/strogatz_bacres1/strogatz_bacres1.tsv.gz
Traceback (most recent call last):
File "evaluate_model.py", line 291, in <module>
**eval_kwargs)
File "evaluate_model.py", line 39, in evaluate_model
features, labels, feature_names = read_file(dataset)
File "/opt/app/srbench/experiment/read_file.py", line 19, in read_file
engine='python')
File "/opt/conda/envs/srbench/lib/python3.7/site-packages/pandas/util/_decorators.py",
line 311, in wrapper
return func(*args, **kwargs)
File "/opt/conda/envs/srbench/lib/python3.7/site-packages/pandas/io/parsers/readers.py",
line 586, in read_csv
return _read(filepath_or_buffer, kwds)
File "/opt/conda/envs/srbench/lib/python3.7/site-packages/pandas/io/parsers/readers.py",
line 482, in _read
parser = TextFileReader(filepath_or_buffer, **kwds)
File "/opt/conda/envs/srbench/lib/python3.7/site-packages/pandas/io/parsers/readers.py",
line 811, in __init__
self._engine = self._make_engine(self.engine)
File "/opt/conda/envs/srbench/lib/python3.7/site-packages/pandas/io/parsers/readers.py",
line 1040, in _make_engine
return mapping[engine](self.f, **self.options) # type: ignore[call-arg]
File "/opt/conda/envs/srbench/lib/python3.7/site-
packages/pandas/io/parsers/python_parser.py", line 100, in __init__
self._make_reader(self.handles.handle)
File "/opt/conda/envs/srbench/lib/python3.7/site-
packages/pandas/io/parsers/python_parser.py", line 203, in _make_reader
line = f.readline()
File "/opt/conda/envs/srbench/lib/python3.7/gzip.py", line 300, in read1
return self._buffer.read1(size)
File "/opt/conda/envs/srbench/lib/python3.7/_compression.py", line 68, in readinto
data = self.read(len(byte_view))
File "/opt/conda/envs/srbench/lib/python3.7/gzip.py", line 474, in read
if not self._read_gzip_header():
File "/opt/conda/envs/srbench/lib/python3.7/gzip.py", line 422, in _read_gzip_header
raise OSError('Not a gzipped file (%r)' % magic)
OSError: Not a gzipped file (b've')

I also tried to manually gunzip the file, but the error message still says it's not in gzip format

$ gunzip /data/pmlb/datasets/strogatz_bacres1/strogatz_bacres1.tsv.gz
gzip: /data/pmlb/datasets/strogatz_bacres1/strogatz_bacres1.tsv.gz: not in gzip format

Could you please resolve this issue for both Feynman and Strogatz datasets? Thank you!

lacava commented 3 years ago

can you confirm you ran 'git lfs fetch' in the pmlb repo? looks like they may be git lfs references still. i need to update the instructions as well since feynman and strogatz datasets are now in master in pmlb

yoshitomo-matsubara commented 3 years ago

Hi @lacava Thank you for the response.

Yes, I did run git lfs fetch for feynman branch. A few minutes ago, I also fetched master branch, but the downloaded tsv.gz files still look the same and are not in gzip format. (returned the same error as shown above).

yoshitomo-matsubara commented 3 years ago

I think we need git lfs pull instead of git lfs fetch. It seems the analyze.py is now working with the downloaded datasets.

lacava commented 3 years ago

glad you found a solution. i believe git lfs pull additionally checks out the branch but fetch will pull the files. I'll update the instructions for the main PMLB branch asap.

yoshitomo-matsubara commented 3 years ago

Thank you for updating the repo! I'll close this issue

yoshitomo-matsubara commented 3 years ago

@lacava It looks like the feynman datasets in PMLB are still incomplete.

metadata.yaml files in strogatz datasets look complete, and analyze.py works with the datasets. However, the metadata.yaml in feynman datasets are incomplete (description = 'None yet. See our contributing guide to help us add one.'), thus failed to get model_str (equations?) and analyze.py failed as follows

========================================
Evaluating tuned.FEATRegressor on
/opt/pmlb/datasets/feynman_III_10_19/feynman_III_10_19.tsv.gz
========================================
compression: gzip
filename: /opt/pmlb/datasets/feynman_III_10_19/feynman_III_10_19.tsv.gz
Traceback (most recent call last):
  File "evaluate_model.py", line 291, in <module>
    **eval_kwargs)
  File "evaluate_model.py", line 41, in evaluate_model
    true_model = get_sym_model(dataset)
  File "/opt/app/srbench/experiment/symbolic_utils.py", line 239, in get_sym_model
    model_str = [ms for ms in description if '=' in ms][0].split('=')[-1]                                     
IndexError: list index out of range
lacava commented 3 years ago

thanks for checking. hm, some of the changes didn't make it into master... i'll look into it.

lacava commented 3 years ago

issued a PR on PMLB to resolve: https://github.com/EpistasisLab/pmlb/pull/158 will check back once it is merged into master.

yoshitomo-matsubara commented 3 years ago

@lacava Thank you for the update! Let me know here once it's merged into master

lacava commented 3 years ago

merged, please update PMLB

marcovirgolin commented 2 years ago

Hi, I am trying this out myself now, and getting an error with all Strogatz problems this time (Feynman's run fine). Namely, when using the python analyze.py -script assess_symbolic_model as indicated in the README, I get errors like the one shown below:

========================================
Assessing tuned.GPGOMEARegressor model for 
../../pmlb/datasets/strogatz_predprey2/strogatz_predprey2.tsv.gz
========================================
looking for: ../results_sym_data/strogatz_predprey2//strogatz_predprey2_tuned.GPGOMEARegressor_860.json
['This is one state of a 2-state dynamic model for predator-prey populations. ', '', '$\\dot{x} = x  \\cdot \\left( 4 - x - \\frac{y}{1+x} \\right)$', '$\\dot{y} = y \\cdot \\left( \\frac{x}{1+x} - 0.075 \\cdot y \\right)$', '', 'It is adapted from Steven Strogatz\'s book "Chaos and Nonlinear Dynamics".  ', 'Each strogatz ODE system can exhibit chaotic and/or nonlinear behavior. ', 'For the purposes of modeling, these systems are simulated using initial conditions within stable basins of attraction. ', 'The systems are simulated using simulink and matlab. ', '']
ValueError: Error from parse_expr with transformed code: "x   \\Symbol ('cdot' ) \\Function ('left' )(Integer (4 )-x - \\frac {y }{Integer (1 )+x } \\Symbol ('right' ))$"

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "assess_symbolic_model.py", line 158, in <module>
    feature_noise=args.X_NOISE)
  File "assess_symbolic_model.py", line 111, in assess_symbolic_model
    assess_symbolic_model_from_file(save_file+'.json', dataset)
  File "assess_symbolic_model.py", line 42, in assess_symbolic_model_from_file
    true_model = get_sym_model(dataset, return_str=False)
  File "/export/scratch1/home/virgolin/srbench/experiment/symbolic_utils.py", line 246, in get_sym_model
    local_dict = {k:Symbol(k) for k in features})
  File "/export/scratch1/home/virgolin/anaconda3/envs/srbench/lib/python3.7/site-packages/sympy/parsing/sympy_parser.py", line 1026, in parse_expr
    raise e from ValueError(f"Error from parse_expr with transformed code: {code!r}")
  File "/export/scratch1/home/virgolin/anaconda3/envs/srbench/lib/python3.7/site-packages/sympy/parsing/sympy_parser.py", line 1017, in parse_expr
    rv = eval_expr(code, local_dict, global_dict)
  File "/export/scratch1/home/virgolin/anaconda3/envs/srbench/lib/python3.7/site-packages/sympy/parsing/sympy_parser.py", line 912, in eval_expr
    code, global_dict, local_dict)  # take local objects in preference
  File "<string>", line 1
    x   \Symbol ('cdot' ) \Function ('left' )(Integer (4 )-x - \frac {y }{Integer (1 )+x } \Symbol ('right' ))$
                                                                                                              ^
SyntaxError: unexpected character after line continuation character
 python analyze.py \
            -script assess_symbolic_model \{'INPUT_FILE': '../../pmlb/datasets/strogatz_shearflow2/strogatz_shearflow2.tsv.gz', 'ALG': 'tuned.GPGOMEARegressor', 'RDIR': '../results_sym_data/strogatz_shearflow2/', 'RANDOM_STATE': 860, 'TEST': False, 'Y_NOISE': 0.0, 'X_NOISE': 0.0, 'SYM_DATA': True, 'JSON_FILE': ''}

I do see that the "true_model" field in the .json results for Strogatz includes a trailing $ at the end.

Perhaps it suffices to add a

model_str = model_str.replace("$","")

in symbolic_utils.get_sym_model?

I'd do a PR but I am not sure whether this is (somehow) a problem only I got, since I see nobody else raising it.

EDIT: removing the $ is not enough

lacava commented 2 years ago

hi @marcovirgolin

you caught a set of changes I hadn't pushed into PMLB.

once the checks complete on https://github.com/EpistasisLab/pmlb/pull/160, you can update from the pmlb master branch. for now you can checkout the strogatz_metadata branch. it seems to work for me on your example:

srbench/experiment$ python assess_symbolic_model.py ../../../pmlb/datasets/strogatz_shearflow2/strogatz_shearflow2.tsv.gz -ml tuned.GPGOMEARegressor -results ../../analysis/results_sym_data_new/strogatz_shearflow2/ -seed 860
{'INPUT_FILE': '../../../pmlb/datasets/strogatz_shearflow2/strogatz_shearflow2.tsv.gz', 'ALG': 'tuned.GPGOMEARegressor', 'RDIR': '../../analysis/results_sym_data_new/strogatz_shearflow2/', 'RANDOM_STATE': 860, 'TEST': False, 'Y_NOISE': 0.0, 'X_NOISE': 0.0, 'SYM_DATA': False, 'JSON_FILE': ''}
========================================
Assessing tuned.GPGOMEARegressor model for
../../../pmlb/datasets/strogatz_shearflow2/strogatz_shearflow2.tsv.gz
========================================
looking for: ../../analysis/results_sym_data_new/strogatz_shearflow2//strogatz_shearflow2_tuned.GPGOMEARegressor_860.json
> /mnt/d/projects/symbolic-regression/srbench/experiment/symbolic_utils.py(244)get_sym_model()
-> model_sym = parse_expr(model_str,
(Pdb) c
compression: gzip
filename: ../../../pmlb/datasets/strogatz_shearflow2/strogatz_shearflow2.tsv.gz
replacing feature 0 with x
replacing feature 1 with y
parsing 0.000170+2.307729*(((((cos(sin(y))*PLOG(PLOG(14.465000)))*cos((cos(y)/(-11.097000--13.964000))))+cos(-20.929000))*sin(x)))
{'x': x, 'y': y, 'add': <class 'sympy.core.add.Add'>, 'mul': <class 'sympy.core.mul.Mul'>, 'max': Max, 'min': Min, 'sub': <function sub at 0x7f4e8ed32790>, 'div': <function div at 0x7f4e8d7e2040>, 'square': <function square at 0x7f4e8d7e20d0>, 'cube': <function cube at 0x7f4e8d7e2160>, 'quart': <function quart at 0x7f4e8d7e21f0>, 'PLOG': <function PLOG at 0x7f4e8d7e2280>, 'PLOG10': <function PLOG at 0x7f4e8d7e2280>, 'PSQRT': <function PSQRT at 0x7f4e8d7e23a0>}
round_floats
rounded: 2.31*(0.983*cos(sin(y))*cos(0.349*cos(y)) - 0.487)*sin(x)
simplify...
simplified: (2.27*cos(sin(y))*cos(0.349*cos(y)) - 1.12)*sin(x)
saving...
sym_diff: -(2.27*cos(sin(y))*cos(0.349*cos(y)) - 1.12)*sin(x) + (0.1*sin(y)**2 + cos(y)**2)*sin(x)
sym_frac: (2.27*cos(sin(y))*cos(0.349*cos(y)) - 1.12)/(0.1*sin(y)**2 + cos(y)**2)
simplified sym_diff: (-0.9*sin(y)**2 - 2.27*cos(sin(y))*cos(0.349*cos(y)) + 2.12)*sin(x)
{
    "dataset": "strogatz_shearflow2",
    "algorithm": "tuned.GPGOMEARegressor",
    "params": {
        "caching": false,
        "classweights": false,
        "elitism": 1,
        "erc": true,
        "evaluations": 1000000,
        "functions": "+_-_*_p/_plog_sqrt_sin_cos",
        "generations": -1,
        "gomea": true,
        "gomfos": "LT",
        "ims": false,
        "initmaxtreeheight": 6,
        "linearscaling": true,
        "maxsize": 1000,
        "maxtreeheight": 17,
        "parallel": false,
        "popsize": 1000,
        "prob": "symbreg",
        "reproduction": 0.0,
        "sbagx": 0.0,
        "sblibtype": false,
        "sbrdo": 0.0,
        "seed": -1,
        "silent": true,
        "subcross": 0.5,
        "submut": 0.5,
        "syntuniqinit": 1000,
        "time": 28800,
        "tournament": 4,
        "unifdepthvar": true
    },
    "random_state": 860,
    "process_time": 133.882689869,
    "time_time": 133.97960495948792,
    "target_noise": 0.0,
    "feature_noise": 0.0,
    "true_model": "(0.1*sin(y)**2 + cos(y)**2)*sin(x)",
    "model_size": 21,
    "symbolic_model": "0.000170+2.307729*(((((cos(sin(x1))*plog(plog(14.465000)))*cos((cos(x1)p/(-11.097000--13.964000))))+cos(-20.929000))*sin(x0)))",
    "mse_train": 1.2889293272279269e-06,
    "mae_train": 0.0008988973869460174,
    "r2_train": 0.9999751463811574,
    "mse_test": 1.3140537910085879e-06,
    "mae_test": 0.0009068998433162603,
    "r2_test": 0.9999816769614173,
    "simplified_symbolic_model": "(2.27*cos(sin(y))*cos(0.349*cos(y)) - 1.12)*sin(x)",
    "simplified_complexity": 15,
    "symbolic_error": "(-0.9*sin(y)**2 - 2.27*cos(sin(y))*cos(0.349*cos(y)) + 2.12)*sin(x)",
    "symbolic_fraction": "(2.27*cos(sin(y))*cos(0.349*cos(y)) - 1.12)/(0.1*sin(y)**2 + cos(y)**2)",
    "symbolic_error_is_zero": false,
    "symbolic_error_is_constant": false,
    "symbolic_fraction_is_constant": false
}
saving...
done.
lacava commented 2 years ago

https://github.com/EpistasisLab/pmlb/pull/160 was merged. update PMLB from git and you should be good to go.