Closed MilesCranmer closed 6 months ago
pretty much depends on how often do users use hall_of_fame
separately?
hall_of_fame, state = equation_search(dataset; options, return_state=true)
dominating = calculate_pareto_curve(hall_of_fame, dataset, options)
according to you
So it would be easier for the user to query. More importantly, I would only return the dominating pareto curve, rather than the entire hall of fame (I doubt anybody wants the entire curve anyways).
so it seems we should just have
dataframe, state = equation_search(dataset; options, return_state=true)
side note: I don't like return_state=true
, I feel like this is a python thing where # of returned objects depends on run-time value. given it's a end-user function it doesn't have much performance implication, so just a taste thing.
I strongly recommend you NOT to depend on DataFrames.jl
, it's unnecessarily heavy. Returning Dict
or NamedTuple
or something like Tables.jl / StructArrays.jl
would be fine, it's trivial to pack into a DataFrame later.
both are good ideas, for MLJ you just want to make an interface package and register it with MLJ ecosystem; I don't know about SciML convention.
Good tips, thanks! Yes perhaps that is best.
e.g., could return a single object result::ResultType
that includes everything: .equations
would be a Tables.jl object of the Pareto frontier, .state
would be the search state, .options
would be a copy of the search options, .best
would be the best expression found (using a similar default as in PySR, combining accuracy and complexity). Perhaps you could call result(X, 5)
to compute the predictions of the 5th expression on the dataset. And plot(result, X, y)
to generate some nice default plots of the Pareto frontier.
More importantly, printing result
would indicate these different fields in a nicely formatted output, so the user doesn’t need to read the API page.
Then, one could pass either result
or result.state
back to equation_search
to continue where it left off. (And perhaps it could just read the options
from there, or accept new options
if the hyperparameters are compatible)
Then, there could be new lightweight frontends for MLJ and SciML.
I am leaning towards an MLJ-style interface. I think the statefulness of the Regressor objects is nice for warm starts, and would be nice for plotting diagnostic info.
This might take the form of some sort of extension package that would load if users also import MLJ.jl.
I wonder if it should come with both a SciML interface (via ModelingToolkit.jl?), and an MLJ one. And the base interface defines internal types for an MLJ-style model setup.
Drafted the following Base.show
method for Options
. I think it looks much better:
Options:
├── Search Space:
│ ├── Unary operators: [cos, sin] # unary_operators
│ ├── Binary operators: [+, *, /, -] # binary_operators
│ ├── Max size of equations: 20 # maxsize
│ └── Max depth of equations: 20 # maxdepth
├── Search Size:
│ ├── Cycles per iteration: 550 # ncycles_per_iteration
│ ├── Number of populations: 15 # npopulations
│ └── Size of each population: 33 # npop
├── The Objective:
│ ├── Elementwise loss function: L2DistLoss # elementwise_loss
│ └── Full loss function (if any): nothing # loss_function
├── Selection:
│ ├── Expressions per tournament: 12 # tournament_selection_n
│ └── p(tournament winner=best expression): 0.86 # tournament_selection_p
├── Migration:
│ ├── Migrate equations: true # migration
│ ├── Migrate hall of fame equations: true # hof_migration
│ ├── p(replaced) during migration: 0.00036 # fraction_replaced
│ ├── p(replaced) during hof migration: 0.035 # fraction_replaced_hof
│ └── Migration candidates per population: 12 # topn
├── Complexities:
│ ├── Parsimony factor: 0.0032 # parsimony
│ ├── Complexity of each operator: [+=>1, *=>1, /=>1, -=>1, cos=>1, sin=>5] # complexity_of_operators
│ ├── Complexity of constants: [1] # complexity_of_constants
│ ├── Complexity of variables: [1] # complexity_of_variables
│ ├── Slowly increase max size: 0.0 # warmup_maxsize_by
│ ├── Use adaptive parsimony: true # use_frequency
│ ├── Use adaptive parsimony in tournament: true # use_frequency_in_tournament
│ ├── Adaptive parsimony scaling factor: 20.0 # adaptive_parsimony_scaling
│ └── Simplify equations: true # should_simplify
When you see this in a REPL, the comments are printed in a light grey color.
My 2c:
A 2-style interface using a Tables.jl
-compatible form (e.g. Vector{NamedTuple}
would be my preference.
Regarding returning the Pareto curve or the entire HoF, how about having something like this:
hall_of_fame, state = equation_search_full(dataset, options)
dominating, state = equation_search(hall_of_fame, dataset, options)
dominating, state = equation_search(dataset, options)
With two equation_search
methods, one of which is simply
equation_search(dataset, options) =
equation_search(equation_search_full(dataset, options), dataset, options)
Alternatively, as has already come up, x, state
could be replaced with some sort of Result
structure. Then one could have the very simple:
full_result = equation_search_full(dataset, options)
dominating_result = equation_search(full_result)
dominating_result = equation_search(dataset, options)
Actually, if you made the Result
structure iterable, you could support both of these usage patterns simultaneously.
This sound like maybe it could be good as a package extension to DataDrivenDiffEq.
This sounds like it could be a good package extension to have here.
I don't think MLJ interface packages make as much sense now we have package extensions.
Speaking of 4., I have an attempt here: #226. Indeed I think it makes the most sense to put it in an extension.
I like your ideas for 1-2. I’ll think more about this.
Moving to mid-importance now that the MLJ interface has matured. Remaining API changes would be to improve the low-level interface.
(Finished a while ago)
Nice!
In either version
0.16 or0.17 of SymbolicRegression.jl, I would like to do a big API overhaul, to make the package easier to use on the Julia side. I think the current API has stuck around too long and needs to be cleaned up a bit. The PySR frontend is a bit more developed and users seem to find it easy to use, so I'd like to do the same thing here.I have a few different ideas but I'm interested in hearing opinions from all users with interests in this package, as I'm not sure the best way forward. The API should make SymbolicRegression.jl: (1) easier to use, and (2) easier to interface with other tools.
1. Maintain API, with a few tweaks
This is basically just renaming the current API to PascalCase for types, snake_case for functions/parameters. This would return a
HallOfFame
object, and astate
that the user can pass back toequation_search
to continue the search. However, I'm not a big fan of this because it requires the user to go figure out the different types and make a few different calls which they would need to find from the docs.2. Return a
DataFrames.DataFrame
objectThis is similar to 1, but the returned object would be a
DataFrame
object from the DataFrames.jl package, with columns:[:equation, :complexity, :loss, :score]
. So it would be easier for the user to query. More importantly, I would only return the dominating pareto curve, rather than the entire hall of fame (I doubt anybody wants the entire curve anyways). The user could sort and query this object as they please. It's not too much of an API change and could make it easier for users to use.3. Tighten interface with DataDrivenDiffEq.jl
It might be nice to have a tighter interface with
DataDrivenDiffEq.jl
, which has its own frontend for SymbolicRegression.jl: https://docs.sciml.ai/DataDrivenDiffEq/stable/libs/datadrivensr/examples/example_01/, as well as some other algorithms. e.g., this could look like:I'm not sure whether it makes sense to integrate with the API on the side of SymbolicRegression.jl though; maybe it's simpler to just have a simple and fully-general core API that others like DataDrivenDiffEq.jl can use in their unified APIs.
4. Integrate with MLJ.jl
It has been nice to integrate PySR with scikit-learn as it lets users stick it into existing sklearn tuning pipelines. Maybe MLJ.jl is the Julia version of that? e.g., something like
This might be nice to take advantage of the API developed in PySR, where, e.g.,
m(X, 2)
would get predictions from the 2nd equation. It might also make it easier for users to restart fits, as they wouldn't need to move around a separatestate
object. I guess due to the similarities with scikit-learn, it might feel more automatic to users as well?In general I think it's preferable to make it easy for users to look at the output equations and plot them (this is the major difference between symbolic regression and typical ML algorithms). Maybe some kind of
plot(m::FittedSRRegressor, X, y)
would be nice for plotting the different equations.Please let me know what you think and any suggestions.
cc'ing anyone who might be interested. I am eager to hear your ideas! @AlCap23 @ChrisRackauckas @kazewong @johanbluecreek @CharFox1 @Jgmedina95 @Patrick-Kidger @Moelf @qwertyjl @Remotion @anicusan