MilesCranmer / PySR

High-Performance Symbolic Regression in Python and Julia
https://ai.damtp.cam.ac.uk/pysr
Apache License 2.0
2.44k stars 217 forks source link

PySRSequenceRegressor #677

Open wenbang24 opened 4 months ago

wenbang24 commented 4 months ago

Resolve #94

Adds ability to use symbolic regression on 1D recurrent array. Added new keyword argument to PySRRegressor called recursive_history_length. Must be at least 1, or will throw a ValueError.

Here an example of how it works: Suppose you have the array [1,2,3,4,5,6,7,8,9] and recursive_history_length = 3. Then the feature array will be

[[1 2 3]
 [2 3 4]
 [3 4 5]
 [4 5 6]
 [5 6 7]
 [6 7 8]]

And the target array will be:

[4, 5, 6, 7, 8, 9]

And it's the same from there.

MilesCranmer commented 4 months ago

Thanks! Regarding the design, I wonder if this shouldn’t go into PySRRegressor itself, but rather be a separate input transform that you would compose with scikit-learn’s Pipeline: https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html#sklearn.pipeline.Pipeline

This would be nice because it would automatically update .predict as well without extra internal code.

Basically I think PySRRegressor is already too large and I’d like to break it into smaller, composable pieces, so it’s easier for users to implement custom behavior.

Alternatively maybe this could be a separate class PySRSequenceRegressor that inherits from PySRRegressor but wraps the methods with transforms.

Also please add some tests, as well as tests for how this interacts with X_units, variable_names, weights, complexity_of_variables, etc.

wenbang24 commented 4 months ago

test_sequence_weighted_bumper test still isn't passing; how do the weights work?

wenbang24 commented 4 months ago

One test (test_sequence_noisy_builtin_variable_names) isn't passing because set_params doesn't work when PySRSequenceRegressor calls super().__init__()

MilesCranmer commented 4 months ago

I think for weights, the time series approach should simply weight by time. So the weights should be shape (n_times,). It should be one weight for every predicted value.

wenbang24 commented 4 months ago

I think for weights, the time series approach should simply weight by time. So the weights should be shape (n_times,). It should be one weight for every predicted value.

yes I think that is how it currently works; I'll try and make the docstring a bit clearer

wenbang24 commented 4 months ago

General comment: should PySRSequenceRegressor allow for multiple features? Right now it is just a scalar feature over time. But what if you want to have two features over time, like X_t and Z_t, and get recursive information for both of them?

(Just wondering why/why not)

Yes, it does support multidimensional data, but since everything is flattened, the output looks unintuitive.

MilesCranmer commented 4 months ago

Yes, it does support multidimensional data, but since everything is flattened, the output looks unintuitive.

I think the loop that constructs the recurrence history goes up to len(y) but it should just be y.shape[0], right?

although if rewritten with np.roll, this should be automatic

coveralls commented 4 months ago

Pull Request Test Coverage Report for Build 11267863294

Details


Changes Missing Coverage Covered Lines Changed/Added Lines %
pysr/regressor_sequence.py 92 100 92.0%
<!-- Total: 108 116 93.1% -->
Files with Coverage Reduction New Missed Lines %
pysr/export_sympy.py 3 77.27%
<!-- Total: 3 -->
Totals Coverage Status
Change from base Build 11225120237: -0.02%
Covered Lines: 1245
Relevant Lines: 1325

💛 - Coveralls
wenbang24 commented 3 months ago

oh is that bad?

wenbang24 commented 3 months ago

I don't think the type checking is ever going to work :( because PySRSequenceRegressor doesn't take y or y_units, but PySRRegressor does, so the signatures will always be different also the return type is different?

wenbang24 commented 3 months ago

nvm type checking works 🎉

MilesCranmer commented 3 months ago

Cool! Yeah you can also generally use cast to tell the type checker what the type is supposed to be – it's good for getting around type issues that aren't actually real problems.

MilesCranmer commented 3 months ago

Btw you can do pre-commit install to have the precommit checks run locally rather than as additional git commits.

MilesCranmer commented 3 months ago

For the predict step I am also wondering if it should recursively roll out the prediction up to some user-specified time in the future? Basically, if the user provides the sequence from 1 to n, for a history length h, and asks for an extra m steps, the result would be times h through n+m? Right now we only predict up to n which doesn't really make sense because the user has already provided that time point.

wenbang24 commented 3 months ago

oh my bad I was not supposed to overwrite example.py

wenbang24 commented 3 months ago

For the predict step I am also wondering if it should recursively roll out the prediction up to some user-specified time in the future? Basically, if the user provides the sequence from 1 to n, for a history length h, and asks for an extra m steps, the result would be times h through n+m? Right now we only predict up to n which doesn't really make sense because the user has already provided that time point.

maybe instead of doing the padding with NaNs thing, we can do this? also we should probably let user pick how many times they want to extrapolate

MilesCranmer commented 3 months ago

Yep sounds good

wenbang24 commented 3 months ago

bad news since PySRRegressor is now an attribute of PySRSequenceRegressor all of the attributes like model.equations_ and functions like model.get_best() don't work anymore :( do we revert back to inheriting from PySRRegressor or is there some workaround?

wenbang24 commented 3 months ago

woah delegation is pretty cool but I keep getting sklearn.exceptions.NotFittedError: This PySRRegressor instance is not fitted yet. Call 'fit' with appropriate arguments before using this estimator.

wenbang24 commented 3 months ago

for some reason

self._regressor.__repr__ = <bound method PySRRegressor.__repr__ of PySRRegressor.equations_ = [
[
           pick  score       equation       loss  complexity
        0          0.0          x1t_3  72.666664           1
        1  >>>>    inf  x0t_2 + x0t_1   0.000000           3
], [
           pick  score       equation       loss  complexity
        0          0.0          x1t_2  130.66667           1
        1  >>>>    inf  x1t_1 + x1t_2    0.00000           3
]]>
wenbang24 commented 3 months ago

btw this is the code

def __getattr__(self, name):
    #return self._regressor.__getattr__(name)
    try:
        return getattr(self._regressor, name)
    except AttributeError:
        raise AttributeError(f"'{self.__class__.__name__}' object has no attribute '{name}'")

def __delattr__(self, name):
    if name == "_regressor":
        print("no delete regressor")
    else:
        delattr(self._regressor, name)

def __setattr__(self, name, value) -> None:
    if name == "_regressor":
        object.__setattr__(self, name, value)
    else:
        setattr(self._regressor, name, value)
        print(self._regressor.__dict__[name])
wenbang24 commented 3 months ago

all tests passing except for test_sequence_variable_names it says truth value of an array with more than one element is ambiguous for n_features == 1 which is weird because it should be an integer because of the type hinting: def _construct_variable_names(self, n_features: int, variable_names=None)

wenbang24 commented 3 months ago

are there any other changes that need to be made?

MilesCranmer commented 3 months ago

I think as a general guiding principle for this PR, since it implements a lot of new methods, we should try to have the testing get to 100% code coverage on the git diff. If we are at less than 100% diff code coverage then it points to things that we can take out (we should prefer less code rather than more... makes my job easier). For example, most of the @property methods are not needed – with code coverage we would be able to see that automatically (which can be done without needing my review so often).

wenbang24 commented 3 months ago

so I rewrote (copied) from_file(), but it doesn't work ValueError: Length mismatch: Expected axis has 1 elements, new values have 2 elements ill try and resolve this in the afternoon

wenbang24 commented 3 months ago

I think PySRSequenceRegressor might have to pickle recursive history length to resolve that value error - pandas thinks there should only be 1 input feature but because recursive history length is 2, there are 2x1 = 2 input features which messes stuff up alternatively we could change n_features_in and nout manually?

wenbang24 commented 3 months ago

ok I don't think its really worth pickling recurrence history length

wenbang24 commented 3 months ago

I don't think my coverage thing is working; how do I check coverage with coveralls?

MilesCranmer commented 3 months ago

For coverage I usually use:

coverage run -m pysr test main,cli,startup

You need to install the coverage python package

If using VSCode I like to use Coverage Gutters: https://marketplace.visualstudio.com/items?itemName=ryanluker.vscode-coverage-gutters

MilesCranmer commented 3 months ago

I think PySRSequenceRegressor might have to pickle recursive history length

This should be automatic though, right? __getstate__ normally just saves all parameters from init, recursively calling __getstate__ on all linked objects. No need for a custom method.

MilesCranmer commented 3 months ago

Also I just noticed – environment.yml should have the numpy version updated to 1.20 minimum

wenbang24 commented 3 months ago

@MilesCranmer I think things are good to go; can you review please?

wenbang24 commented 3 months ago

*docstring

wenbang24 commented 3 months ago

so latex table outputs x_{t-0} now. wdyt?

MilesCranmer commented 2 months ago

Sorry for so many requests, but do you think you could add an example of using it to the documentation? Currently it is undocumented.

MilesCranmer commented 2 months ago

so latex table outputs x_{t-0} now. wdyt?

Why only t-0 instead of (t-1, t-2, t-3, ...)?

MilesCranmer commented 2 months ago

I pushed a few changes to make the printing look nicer:

Screenshot 2024-08-26 at 14 35 03

Let me know what you think.

MilesCranmer commented 2 months ago

Something is a bit weird about multiple features. I guess one issue with this is that right now multiple outputs are fit completely separately, rather than together as a single equation. I'm not sure how to work with this.... What do you think?

wenbang24 commented 2 months ago

I think that although one equation would simplify things, there are so many recurrence relations with multiple equations that wouldn't work anymore, so I think we should keep multioutput

wenbang24 commented 2 months ago

also the pretty printing is indeed pretty pretty; thanks

wenbang24 commented 2 months ago

I think stuff breaks without np.array(X)

MilesCranmer commented 2 months ago

I think stuff breaks without np.array(X)

Weird, normally scikit-learn's checking function should automatically convert to a numpy array. The normal PySRRegressor you can pass it lists or pandas dataframes and it will get converted by scikit-learn to a numpy array. Is there a reason scikit-learn's conversion doesn't work here?

wenbang24 commented 2 months ago

yeah it works mb but the examples look a bit weird with X.append(X[-1][0] + X[-2][0]) should i do that or X = np.append(X, X[-1] + X[-2])

wenbang24 commented 2 months ago

also theres a test failing in TestDimensionalConstraints and idk what's going on with it, but I think its something to do with a display_variable_names parameter messing up positional arguments

wenbang24 commented 2 months ago

ok i stopped the error; is that the right way to fix it?

wenbang24 commented 2 months ago

how do we fix the typing issue?? im stumped

wenbang24 commented 2 months ago

yo all checks passed 🎉

wenbang24 commented 2 months ago

@MilesCranmer anything else I need to do before merging?

MilesCranmer commented 2 months ago

Sorry for not reviewing the latest yet, got back from holiday recently and still catching up.

In the meantime you might consider some “dogfooding” (https://en.m.wikipedia.org/wiki/Eating_your_own_dog_food) of PySRSequenceRegressor on some real problems that you have in mind? Just by using it on some real problems you will be able to find ways of improving the API. Maybe look up some math sequences or simple physics problems and try it out, and see what breaks and what works well, and what parts of the API are easy and what parts are confusing. (This is how I develop PySR in general - I’m both a user and a dev!)

Anyways I hope to get to reviewing soon.