MilesCranmer / SymbolicRegression.jl

Distributed High-Performance Symbolic Regression in Julia
https://ai.damtp.cam.ac.uk/symbolicregression/
Apache License 2.0
644 stars 86 forks source link

Evaluating trees with features not in dataset #133

Open Jgmedina95 opened 2 years ago

Jgmedina95 commented 2 years ago

Hi, im working on a process that uses the equation returned in the Pareto Frontier. I was playing around with it and found the following: Setting a tree like: image and evaluating with a dataset with 6 features, gives no problem, as expected. image

but if i create a tree that uses more features than in the original dataset like: image

I expected an error, but the function eval_tree_array gave an output and completion ==true. image I've realized this will not impact what I'm doing as trees made by the program will never (right?) have more features than the dataset, but I suppose its interesting enough of a bug to share.

MilesCranmer commented 2 years ago

Hi @Jgmedina95,

Thanks for raising this! So this should be because @inbounds is used throughout the evaluation to have faster compute. I make an in-bounds assumption since trees made during the search should never have different features than available in the dataset. However, as you raise, I don't think it hurts to check before running the evaluation, so maybe we could do that. (it could be that the user could save their search state, then re-run with a different dataset, and be surprised when trees access undefined memory?)

I just looked at the code, and it seems like sometimes @inbounds is actually not used, such as: https://github.com/MilesCranmer/SymbolicRegression.jl/blob/6075f13c3e8fbb8b16686a6f7c1157f0235174ee/src/EvaluateEquation.jl#L163

Whereas, @inbounds is used in the functions which fuse operators together, like here: https://github.com/MilesCranmer/SymbolicRegression.jl/blob/6075f13c3e8fbb8b16686a6f7c1157f0235174ee/src/EvaluateEquation.jl#L189

So perhaps for some trees, this would raise an error, and sometimes it would not. Will think more whether we should add a bounds check.

Best, Miles