MilesCranmer commented 1 year ago

In either version ~~0.16 or~~ 0.17 of SymbolicRegression.jl, I would like to do a big API overhaul, to make the package easier to use on the Julia side. I think the current API has stuck around too long and needs to be cleaned up a bit. The PySR frontend is a bit more developed and users seem to find it easy to use, so I'd like to do the same thing here.

I have a few different ideas but I'm interested in hearing opinions from all users with interests in this package, as I'm not sure the best way forward. The API should make SymbolicRegression.jl: (1) easier to use, and (2) easier to interface with other tools.

1. Maintain API, with a few tweaks

options = Options(; kws...)
dataset = Dataset(X, y; feature_names=["v1", "v2", "energy"])
hall_of_fame, state = equation_search(dataset; options, return_state=true)
dominating = calculate_pareto_curve(hall_of_fame, dataset, options)

preds = dominating[end].tree(X)
println(dominating[end].tree)

This is basically just renaming the current API to PascalCase for types, snake_case for functions/parameters. This would return a HallOfFame object, and a state that the user can pass back to equation_search to continue the search. However, I'm not a big fan of this because it requires the user to go figure out the different types and make a few different calls which they would need to find from the docs.

2. Return a `DataFrames.DataFrame` object

options = Options(; kws...)
dataset = Dataset(X, y; feature_names=["v1", "v2", "energy"])
dataframe, state = equation_search(dataset; options, return_state=true)

dataframe.equation[end](X)
println(dataframe.equation[end-3:end]) # Print last 3 equations

This is similar to 1, but the returned object would be a DataFrame object from the DataFrames.jl package, with columns: [:equation, :complexity, :loss, :score]. So it would be easier for the user to query. More importantly, I would only return the dominating pareto curve, rather than the entire hall of fame (I doubt anybody wants the entire curve anyways). The user could sort and query this object as they please. It's not too much of an API change and could make it easier for users to use.

3. Tighten interface with DataDrivenDiffEq.jl

It might be nice to have a tighter interface with DataDrivenDiffEq.jl, which has its own frontend for SymbolicRegression.jl: https://docs.sciml.ai/DataDrivenDiffEq/stable/libs/datadrivensr/examples/example_01/, as well as some other algorithms. e.g., this could look like:

options = SearchOptions(; kws...)
space = SearchSpace(; kws2...)
alg = SR(search_options=options, search_space=space)
prob = RegressionProblem(X, y; feature_names=["v1", "v2", "energy"])

res = solve(prob, alg; soln_options...)

I'm not sure whether it makes sense to integrate with the API on the side of SymbolicRegression.jl though; maybe it's simpler to just have a simple and fully-general core API that others like DataDrivenDiffEq.jl can use in their unified APIs.

4. Integrate with MLJ.jl

It has been nice to integrate PySR with scikit-learn as it lets users stick it into existing sklearn tuning pipelines. Maybe MLJ.jl is the Julia version of that? e.g., something like

config = SRRegressor(; kws...)
m = fit(config; X, y, feature_names=["v1", "v2", "energy"])

preds = m(X)
m.equations  # DataFrame of equations?

This might be nice to take advantage of the API developed in PySR, where, e.g., m(X, 2) would get predictions from the 2nd equation. It might also make it easier for users to restart fits, as they wouldn't need to move around a separate state object. I guess due to the similarities with scikit-learn, it might feel more automatic to users as well?

In general I think it's preferable to make it easy for users to look at the output equations and plot them (this is the major difference between symbolic regression and typical ML algorithms). Maybe some kind of plot(m::FittedSRRegressor, X, y) would be nice for plotting the different equations.

Please let me know what you think and any suggestions.

cc'ing anyone who might be interested. I am eager to hear your ideas! @AlCap23 @ChrisRackauckas @kazewong @johanbluecreek @CharFox1 @Jgmedina95 @Patrick-Kidger @Moelf @qwertyjl @Remotion @anicusan

Moelf commented 1 year ago

1.

pretty much depends on how often do users use hall_of_fame separately?

hall_of_fame, state = equation_search(dataset; options, return_state=true)
dominating = calculate_pareto_curve(hall_of_fame, dataset, options)

according to you

So it would be easier for the user to query. More importantly, I would only return the dominating pareto curve, rather than the entire hall of fame (I doubt anybody wants the entire curve anyways).

so it seems we should just have

dataframe, state = equation_search(dataset; options, return_state=true)

side note: I don't like return_state=true, I feel like this is a python thing where # of returned objects depends on run-time value. given it's a end-user function it doesn't have much performance implication, so just a taste thing.

On DataFrame

I strongly recommend you NOT to depend on DataFrames.jl, it's unnecessarily heavy. Returning Dict or NamedTuple or something like Tables.jl / StructArrays.jl would be fine, it's trivial to pack into a DataFrame later.

DataDrivenDiffEq and MLJ

both are good ideas, for MLJ you just want to make an interface package and register it with MLJ ecosystem; I don't know about SciML convention.

MilesCranmer commented 1 year ago

Good tips, thanks! Yes perhaps that is best.

e.g., could return a single object result::ResultType that includes everything: .equations would be a Tables.jl object of the Pareto frontier, .state would be the search state, .options would be a copy of the search options, .best would be the best expression found (using a similar default as in PySR, combining accuracy and complexity). Perhaps you could call result(X, 5) to compute the predictions of the 5th expression on the dataset. And plot(result, X, y) to generate some nice default plots of the Pareto frontier.

More importantly, printing result would indicate these different fields in a nicely formatted output, so the user doesn’t need to read the API page.

Then, one could pass either result or result.state back to equation_search to continue where it left off. (And perhaps it could just read the options from there, or accept new options if the hyperparameters are compatible)

Then, there could be new lightweight frontends for MLJ and SciML.

MilesCranmer commented 1 year ago

I am leaning towards an MLJ-style interface. I think the statefulness of the Regressor objects is nice for warm starts, and would be nice for plotting diagnostic info.

This might take the form of some sort of extension package that would load if users also import MLJ.jl.

MilesCranmer commented 1 year ago

I wonder if it should come with both a SciML interface (via ModelingToolkit.jl?), and an MLJ one. And the base interface defines internal types for an MLJ-style model setup.

MilesCranmer commented 1 year ago

Drafted the following Base.show method for Options. I think it looks much better:

Options:
├── Search Space:
│   ├── Unary operators: [cos, sin]                       # unary_operators
│   ├── Binary operators: [+, *, /, -]                    # binary_operators
│   ├── Max size of equations: 20                         # maxsize
│   └── Max depth of equations: 20                        # maxdepth
├── Search Size:
│   ├── Cycles per iteration: 550                         # ncycles_per_iteration
│   ├── Number of populations: 15                         # npopulations
│   └── Size of each population: 33                       # npop
├── The Objective:
│   ├── Elementwise loss function: L2DistLoss             # elementwise_loss
│   └── Full loss function (if any): nothing              # loss_function
├── Selection:
│   ├── Expressions per tournament: 12                    # tournament_selection_n
│   └── p(tournament winner=best expression): 0.86        # tournament_selection_p
├── Migration:
│   ├── Migrate equations: true                           # migration
│   ├── Migrate hall of fame equations: true              # hof_migration
│   ├── p(replaced) during migration: 0.00036             # fraction_replaced
│   ├── p(replaced) during hof migration: 0.035           # fraction_replaced_hof
│   └── Migration candidates per population: 12           # topn
├── Complexities:
│   ├── Parsimony factor: 0.0032                          # parsimony
│   ├── Complexity of each operator: [+=>1, *=>1, /=>1, -=>1, cos=>1, sin=>5]  # complexity_of_operators
│   ├── Complexity of constants: [1]                      # complexity_of_constants
│   ├── Complexity of variables: [1]                      # complexity_of_variables
│   ├── Slowly increase max size: 0.0                     # warmup_maxsize_by
│   ├── Use adaptive parsimony: true                      # use_frequency
│   ├── Use adaptive parsimony in tournament: true        # use_frequency_in_tournament
│   ├── Adaptive parsimony scaling factor: 20.0           # adaptive_parsimony_scaling
│   └── Simplify equations: true                          # should_simplify

When you see this in a REPL, the comments are printed in a light grey color.

tecosaur commented 1 year ago

My 2c:

1/2

A 2-style interface using a Tables.jl-compatible form (e.g. Vector{NamedTuple} would be my preference.

Regarding returning the Pareto curve or the entire HoF, how about having something like this:

hall_of_fame, state = equation_search_full(dataset, options)
dominating, state = equation_search(hall_of_fame, dataset, options)
dominating, state = equation_search(dataset, options)

With two equation_search methods, one of which is simply

equation_search(dataset, options) = 
     equation_search(equation_search_full(dataset, options), dataset, options)

Alternatively, as has already come up, x, state could be replaced with some sort of Result structure. Then one could have the very simple:

full_result = equation_search_full(dataset, options)
dominating_result = equation_search(full_result)
dominating_result = equation_search(dataset, options)

Actually, if you made the Result structure iterable, you could support both of these usage patterns simultaneously.

3

This sound like maybe it could be good as a package extension to DataDrivenDiffEq.

4

This sounds like it could be a good package extension to have here.

I don't think MLJ interface packages make as much sense now we have package extensions.

MilesCranmer commented 1 year ago

Speaking of 4., I have an attempt here: #226. Indeed I think it makes the most sense to put it in an extension.

I like your ideas for 1-2. I’ll think more about this.

MilesCranmer commented 1 year ago

Moving to mid-importance now that the MLJ interface has matured. Remaining API changes would be to improve the low-level interface.

MilesCranmer commented 6 months ago

(Finished a while ago)

tecosaur commented 6 months ago

Nice!

MilesCranmer / SymbolicRegression.jl

API Overhaul #187

1. Maintain API, with a few tweaks

2. Return a `DataFrames.DataFrame` object

3. Tighten interface with DataDrivenDiffEq.jl

4. Integrate with MLJ.jl

1.

On DataFrame

DataDrivenDiffEq and MLJ

1/2

3

4

MilesCranmer / SymbolicRegression.jl

API Overhaul #187

1. Maintain API, with a few tweaks

2. Return a DataFrames.DataFrame object

3. Tighten interface with DataDrivenDiffEq.jl

4. Integrate with MLJ.jl

1.

On DataFrame

DataDrivenDiffEq and MLJ

1/2

3

4

2. Return a `DataFrames.DataFrame` object