Open wenbang24 opened 4 months ago
Thanks! Regarding the design, I wonder if this shouldn’t go into PySRRegressor itself, but rather be a separate input transform that you would compose with scikit-learn’s Pipeline: https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html#sklearn.pipeline.Pipeline
This would be nice because it would automatically update .predict
as well without extra internal code.
Basically I think PySRRegressor is already too large and I’d like to break it into smaller, composable pieces, so it’s easier for users to implement custom behavior.
Alternatively maybe this could be a separate class PySRSequenceRegressor
that inherits from PySRRegressor
but wraps the methods with transforms.
Also please add some tests, as well as tests for how this interacts with X_units
, variable_names
, weights
, complexity_of_variables
, etc.
test_sequence_weighted_bumper test still isn't passing; how do the weights work?
One test (test_sequence_noisy_builtin_variable_names) isn't passing because set_params doesn't work when PySRSequenceRegressor calls super().__init__()
I think for weights
, the time series approach should simply weight by time. So the weights should be shape (n_times,)
. It should be one weight for every predicted value.
I think for
weights
, the time series approach should simply weight by time. So the weights should be shape(n_times,)
. It should be one weight for every predicted value.
yes I think that is how it currently works; I'll try and make the docstring a bit clearer
General comment: should
PySRSequenceRegressor
allow for multiple features? Right now it is just a scalar feature over time. But what if you want to have two features over time, likeX_t
andZ_t
, and get recursive information for both of them?(Just wondering why/why not)
Yes, it does support multidimensional data, but since everything is flattened, the output looks unintuitive.
Yes, it does support multidimensional data, but since everything is flattened, the output looks unintuitive.
I think the loop that constructs the recurrence history goes up to len(y)
but it should just be y.shape[0]
, right?
although if rewritten with np.roll
, this should be automatic
Changes Missing Coverage | Covered Lines | Changed/Added Lines | % | ||
---|---|---|---|---|---|
pysr/regressor_sequence.py | 92 | 100 | 92.0% | ||
<!-- | Total: | 108 | 116 | 93.1% | --> |
Files with Coverage Reduction | New Missed Lines | % | ||
---|---|---|---|---|
pysr/export_sympy.py | 3 | 77.27% | ||
<!-- | Total: | 3 | --> |
Totals | |
---|---|
Change from base Build 11225120237: | -0.02% |
Covered Lines: | 1245 |
Relevant Lines: | 1325 |
oh is that bad?
I don't think the type checking is ever going to work :( because PySRSequenceRegressor doesn't take y or y_units, but PySRRegressor does, so the signatures will always be different also the return type is different?
nvm type checking works 🎉
Cool! Yeah you can also generally use cast
to tell the type checker what the type is supposed to be – it's good for getting around type issues that aren't actually real problems.
Btw you can do pre-commit install
to have the precommit checks run locally rather than as additional git commits.
For the predict
step I am also wondering if it should recursively roll out the prediction up to some user-specified time in the future? Basically, if the user provides the sequence from 1
to n
, for a history length h
, and asks for an extra m
steps, the result would be times h
through n+m
? Right now we only predict up to n
which doesn't really make sense because the user has already provided that time point.
oh my bad I was not supposed to overwrite example.py
For the
predict
step I am also wondering if it should recursively roll out the prediction up to some user-specified time in the future? Basically, if the user provides the sequence from1
ton
, for a history lengthh
, and asks for an extram
steps, the result would be timesh
throughn+m
? Right now we only predict up ton
which doesn't really make sense because the user has already provided that time point.
maybe instead of doing the padding with NaNs thing, we can do this? also we should probably let user pick how many times they want to extrapolate
Yep sounds good
bad news
since PySRRegressor is now an attribute of PySRSequenceRegressor all of the attributes like model.equations_
and functions like model.get_best()
don't work anymore :(
do we revert back to inheriting from PySRRegressor or is there some workaround?
woah delegation is pretty cool
but I keep getting sklearn.exceptions.NotFittedError: This PySRRegressor instance is not fitted yet. Call 'fit' with appropriate arguments before using this estimator.
for some reason
self._regressor.__repr__ = <bound method PySRRegressor.__repr__ of PySRRegressor.equations_ = [
[
pick score equation loss complexity
0 0.0 x1t_3 72.666664 1
1 >>>> inf x0t_2 + x0t_1 0.000000 3
], [
pick score equation loss complexity
0 0.0 x1t_2 130.66667 1
1 >>>> inf x1t_1 + x1t_2 0.00000 3
]]>
btw this is the code
def __getattr__(self, name):
#return self._regressor.__getattr__(name)
try:
return getattr(self._regressor, name)
except AttributeError:
raise AttributeError(f"'{self.__class__.__name__}' object has no attribute '{name}'")
def __delattr__(self, name):
if name == "_regressor":
print("no delete regressor")
else:
delattr(self._regressor, name)
def __setattr__(self, name, value) -> None:
if name == "_regressor":
object.__setattr__(self, name, value)
else:
setattr(self._regressor, name, value)
print(self._regressor.__dict__[name])
all tests passing except for test_sequence_variable_names
it says truth value of an array with more than one element is ambiguous for n_features == 1
which is weird because it should be an integer because of the type hinting: def _construct_variable_names(self, n_features: int, variable_names=None)
are there any other changes that need to be made?
I think as a general guiding principle for this PR, since it implements a lot of new methods, we should try to have the testing get to 100% code coverage on the git diff. If we are at less than 100% diff code coverage then it points to things that we can take out (we should prefer less code rather than more... makes my job easier). For example, most of the @property
methods are not needed – with code coverage we would be able to see that automatically (which can be done without needing my review so often).
so I rewrote (copied) from_file(), but it doesn't work
ValueError: Length mismatch: Expected axis has 1 elements, new values have 2 elements
ill try and resolve this in the afternoon
I think PySRSequenceRegressor might have to pickle recursive history length to resolve that value error - pandas thinks there should only be 1 input feature but because recursive history length is 2, there are 2x1 = 2 input features which messes stuff up alternatively we could change n_features_in and nout manually?
ok I don't think its really worth pickling recurrence history length
I don't think my coverage thing is working; how do I check coverage with coveralls?
For coverage I usually use:
coverage run -m pysr test main,cli,startup
You need to install the coverage
python package
If using VSCode I like to use Coverage Gutters: https://marketplace.visualstudio.com/items?itemName=ryanluker.vscode-coverage-gutters
I think PySRSequenceRegressor might have to pickle recursive history length
This should be automatic though, right? __getstate__
normally just saves all parameters from init
, recursively calling __getstate__
on all linked objects. No need for a custom method.
Also I just noticed – environment.yml
should have the numpy version updated to 1.20 minimum
@MilesCranmer I think things are good to go; can you review please?
*docstring
so latex table outputs x_{t-0}
now. wdyt?
Sorry for so many requests, but do you think you could add an example of using it to the documentation? Currently it is undocumented.
so latex table outputs
x_{t-0}
now. wdyt?
Why only t-0
instead of (t-1
, t-2
, t-3
, ...)?
I pushed a few changes to make the printing look nicer:
Let me know what you think.
Something is a bit weird about multiple features. I guess one issue with this is that right now multiple outputs are fit completely separately, rather than together as a single equation. I'm not sure how to work with this.... What do you think?
I think that although one equation would simplify things, there are so many recurrence relations with multiple equations that wouldn't work anymore, so I think we should keep multioutput
also the pretty printing is indeed pretty pretty; thanks
I think stuff breaks without np.array(X)
I think stuff breaks without np.array(X)
Weird, normally scikit-learn's checking function should automatically convert to a numpy array. The normal PySRRegressor you can pass it lists or pandas dataframes and it will get converted by scikit-learn to a numpy array. Is there a reason scikit-learn's conversion doesn't work here?
yeah it works mb
but the examples look a bit weird with X.append(X[-1][0] + X[-2][0])
should i do that or X = np.append(X, X[-1] + X[-2])
also theres a test failing in TestDimensionalConstraints and idk what's going on with it, but I think its something to do with a display_variable_names parameter messing up positional arguments
ok i stopped the error; is that the right way to fix it?
how do we fix the typing issue?? im stumped
yo all checks passed 🎉
@MilesCranmer anything else I need to do before merging?
Sorry for not reviewing the latest yet, got back from holiday recently and still catching up.
In the meantime you might consider some “dogfooding” (https://en.m.wikipedia.org/wiki/Eating_your_own_dog_food) of PySRSequenceRegressor on some real problems that you have in mind? Just by using it on some real problems you will be able to find ways of improving the API. Maybe look up some math sequences or simple physics problems and try it out, and see what breaks and what works well, and what parts of the API are easy and what parts are confusing. (This is how I develop PySR in general - I’m both a user and a dev!)
Anyways I hope to get to reviewing soon.
Resolve #94
Adds ability to use symbolic regression on 1D recurrent array. Added new keyword argument to PySRRegressor called
recursive_history_length
. Must be at least 1, or will throw aValueError
.Here an example of how it works: Suppose you have the array
[1,2,3,4,5,6,7,8,9]
andrecursive_history_length = 3
. Then the feature array will beAnd the target array will be:
And it's the same from there.