Closed ShuhuaGao closed 3 years ago
Hi @ShuhuaGao,
Great question! This is actually (and perhaps surprisingly) a very essential part of the quick evaluation speed of SymbolicRegression.jl, since detecting these lets the search quickly throw out equations that have blow-ups, reducing the equation search space. Indeed, Infs and NaNs occur quite a bit due to exploring functions like exp
and pow
with arbitrary arguments. But rather than these being a problem, they are actually beneficial to the search if you treat them correctly.
My finding is that it's slower to write protected functions; it's better to detect blow-ups in the evaluation code, and throw out the equation that produced them. This seems to keep the population leaner.
Equations are evaluated in this package by traversing the trees using this function: https://github.com/MilesCranmer/SymbolicRegression.jl/blob/1cd40f91cf0aae0dffd32faf12980d863a042a7a/src/EvaluateEquation.jl#L14-L34 (operators get fused to reduce memory allocation; hence all the different options, but there's really just three possibilities for degree=0,1,2 operators)
Each of these different function evaluators returns the type:
Tuple{AbstractVector{T}, Bool}
The Bool
part of this is a flag means that a NaN
or Inf
was detected during computation. Once detected, the evaluator immediately breaks out of the evaluation loop, and returns (array, false)
. This flag is returned to the evaluation of the parent node, which then also returns with (array, false)
before any computation, and so on.
The function calling evalTreeArray
checks whether the evaluation completed, and if it hasn't, scores the equation with a very large loss, which essentially means it is thrown out of the next generation.
So basically this means much less computation, and I throw out unstable equations before they produce children.
Since I break immediately on detection, I never have the case where I am passing Inf
or NaN
to an operator - I always break before this happens.
Does this help?
Btw, want to collaborate on whatever you are working on? Shoot me an email if you want to chat: miles.cranmer@gmail.com. The core algorithm of this package is in src/Mutate.jl
. It takes in an equation, and returns the updated/mutated one. Could the algorithm you have in mind be used in place of this code or does it require completely different data structures?
Cheers, Miles
Thanks for the quick response. I am trying to implement "Cartesian genetic programming (CGP)", a variant of the tree-based GP. Symbolic regression is the most widely used application, and I am using it to test my implementation, though my final goal is not symbolic regression from X to Y but to learn a formula, e.g., to play a game. You may check a naive implementation in Python, which acts as a game AI for flappy bird, here.
Your SymbolicRegression.jl is excellent in speed. Currently, I am still debugging and profiling my CGP implementation (not open-sourced yet). As for the evaluation, I took a different way. I first translate the tree into a Julia function in its string form. For example, given a tree of root "+" and two leaves "x" and "y", I first produce the following function (as a string)
function f(x, y)
return x+ y
end
In general, each node corresponds to one line of code. Next, the function string is compiled into an executable function with Meta.parse
and eval
. I don't know whether you tried this method, but I guess it may be faster, especially in a more general setting, where you cannot simply vectorize, than traversing the tree each time the function is called. For example, again, when the function is called many times in a game play until the game ends. Of course, I believe your approach is faster in the specific symbolic regression problem X --> Y due to ready vectorization. In my current workflow, however, each sample of X is processed separately. But it is possible to accept vectors as well, just like yours, by changing x+y
above to x .+ y
.
Let's come back to your reply above. Could you make some further explanation?
Once detected, the evaluator immediately breaks out of the evaluation loop, and returns (array, false).
Do you mean that the output of each node is checked and the evaluation terminates once a NaN
or Inf
appears?
Tuple{AbstractVector{T}, Bool}
I do not notice that you check the Bool
flag in the code you posted above.
My finding is that it's slower to write protected functions; it's better to detect blow-ups in the evaluation code, and throw out the equation that produced them
What do you exactly mean by throw out? Do you mean assigning the tree (i.e., individual) a very bad fitness such that it is unlikely to get selected to produce offspring? Or do you mean to eliminate the individual from the population completely? My feeling is that, even if a tree generates NaN
or Inf
, it may still be useful after mutation or crossover. I assume that you are using Koza's classic GP.
My current implementation of CGP and symbolic regression using CGP is very slow (just finished two hours ago 😜). I will debug and profile the code in the next few days. I am not sure whether it is caused by numerical issues like NaN
(no special handling for now) or the Meta.parse
& eval
workflow or something else. Nonetheless, I agree that your handling of numerical issues is better than protected operations.
Nice, CGP is pretty interesting!! It uses gradients, right?
That project sounds really exciting! I'm interested to see how it goes. Btw you should check out this paper: https://astroautomata.com/paper/symbolic-neural-nets/. You basically learn a neural network on your problem first, then extract symbolic equations from the net second. I think this is an easy way to learn equations for general problems using a simple Symbolic regressor, rather than needing to write more complex regressors.
Next, the function string is compiled into an executable function with
Meta.parse
andeval
.
I actually tried this at one point. It was extremely slow: quite literally like 1,000,000x slower than traversing the tree using pre-compiled recurisve functions. I think even pure-Python & numpy will be faster than parsing expression trees like this, since Julia's compile speed takes some time. Basically: Julia = slow compile + fast evaluation. Python = zero compile + medium evaluation. But if you need to re-compile things in Julia (=eval), you lose the benefits.
I definitely recommend taking a similar approach to this package! It is not too bad to write a simple version; here's a simplified version of how I evaluate:
function evalTree(X, tree)
if tree.degree == 0
if tree.constant
return tree.value
end
return X[tree.feature]
elseif tree.degree == 1
x = evalTree(X, tree.left)
return unary_operators[tree.op](x)
else
x = evalTree(X, tree.left)
y = evalTree(X, tree.right)
return binary_operators[tree.op](x, y)
end
end
Do you mean that the output of each node is checked and the evaluation terminates once a NaN or Inf appears?
Exactly!
Or do you mean to eliminate the individual from the population completely?
I am using regularized evolution with my own spin on it - I do simulated annealing as well. (I still need to write a paper or technical doc on how this all works... date TBD)
Basically it amounts to this:
So basically if there is a Nan or Inf, I just reject the new equation during step 4. It's the same as throwing it out completely. I'm not sure if this prevents the search from finding certain equations, but it seems to really improve the speed.
In the code in EvaluateEquations.jl, I check in the auxiliary functions rather than the main if statement. e.g., here: https://github.com/MilesCranmer/SymbolicRegression.jl/blob/50c2bd3cd325924cf748263271147b8ad10054f2/src/EvaluateEquation.jl#L55
Another option is to basically fork this package, and then change the data type from AbstractVector{T}
to be whatever datatype you need in your application, and change to use operators for your specific data structure. Then you get to keep the same mutation functions, evolution, and distributed search code.
The nice thing about Julia is types are really easy to manipulate so you can make code very generic to different applications
Nice, CGP is pretty interesting!! It uses gradients, right?
(Sorry, I was thinking about differential CGP; which is a different symbolic regression software)
But in general I am also interested in adapting SymbolicRegression.jl to general programs rather than just expressions... Let me know if you want to collaborate on this. SymbolicRegression.jl doesn't use gradients anywhere, so in principle this should already be possible!
Thank you very much for the detailed explanation. It really helps. I guess the slow speed is due to eval
but need profiling to confirm it. I have played with some Python packages before and the eval
is used there, but as you said: "Basically: Julia = slow compile + fast evaluation. Python = zero compile + medium evaluation.".
You are right that the tree structure in your package can in principle be extended beyond symbolic regression. I will focus on CGP for now, because CGP seems to be simpler yet more flexible than the classic tree GP. As you mentioned already, it can be made differentiable. It is always a pain to learn numeric coefficients in general genetic programming. Of course, I believe it is also possible to make SymbolicRegression.jl differentiable w.r.t some numeric weights thanks to the powerful automatic differentiation infrastructure in Julia.
Another reason that I adopt the eval
method is that I want to leverage GPU with CUDA.jl; otherwise the tree itself has to be copied to the GPU for execution as well. GPU is a massively parallel device and seems ideal for evolutionary computation even if we do not have access to a CPU cluster. You may also consider this direction in the future development. If the interface AbstractVector{T}
is used exclusively, then SymbolicRegression.jl should already support CuArray
in CUDA.jl which can accelerate large datasets.
I will play with CGP first for now. If the performance is satisfactory, I can provide it as another backend of your pysr. If not, it is a good idea to fork your repository for further development. (The goal is not only to solve a particular problem but also to practice my Julia skills; and that's why I want to write it from scratch 😃).
I never tried vector inputs/outputs in GP trees before. May I ask you some questions?
Does your code produce an intermediate variable (also a vector) for every node? We know that Julia can fuse "dot" operators like d = a .+b .+ c
. In a naive tree traversal, we may first temp = a .+ b
and then d = temp .+ c
. However, the allocation of a temporary vector temp
may be expensive. How do you leverage the fuse capability if any in the evaluation of a tree?
The operator at each node is vectorized by "dot", right? In that case, if m
input variables in X
have been picked a tree, then we select the m
-variable subset from X
to feed it into the tree, and the final output of the tree should also be m
-dim, right? In this case, do you perform some extra reduction operation at the end to get a single-dim output? For example, something like scaled symbolic regression, which also helps to identify more accurate numeric coefficients. In plain words, GP does feature engineering here and we perform a linear regression on the transformed features to approximate the target variables Y
.
Sounds good, look forward to seeing it.
Yes, CuArray <: AbstractArray
(is a subset of) so in principle a lot of the equation evaluation is already ready for GPUs. But I probably need to use the dot notation instead of my manual loops so the CUDA kernels get generated properly.
For these other questions you should post on the Julia discourse since the questions are general to Julia and not specific to my package; you’ll probably get more helpful answers from people who know more than me!
Re: memory allocation, this is basically done via a .= op.(a, b)
at each node. So you aren’t creating new arrays, since you re-use a
by .=
operator.
Btw; I got this package working on CUDA. Julia really makes this so easy, it's awesome. It took me like 10 minutes to get it working!
It's on the cuda
branch if you want to check it out.
@MilesCranmer That's great. There are very few open-source packages that can work on CUDA. Yours may be the first one! It would be better to show performance benchmarking results in your documentation.
Let me know if you need access to the GPU CI servers.
Thanks @ChrisRackauckas; will do!
cc @ShuhuaGao Regarding the CUDA stuff, I think the best option is to batch over all trees at once in a population (e.g., for 1000 individuals, one could shuffle them pick the best individual among each 10th, mutate 100 trees, then update those tree's scores at once rather than consecutively). But this will be more work:
To do this efficiently, I basically need to put every single tree into a binary heap format (i.e., assume every equation is a full binary tree up to the maximum depth; have separate heap arrays for to index operators, features, etc), stack those into a matrix, and evaluate them all at once in a single kernel (use the operator
array to pick which operator to use inside each CUDA thread, etc). I think is one way to get the full speed out of a GPU. But this will obviously be tricky to set up.
@MilesCranmer Thanks a lot. But note that
@ShuhuaGao - thanks; really useful info!
1
It's been a few years since I've done serious C++ CUDA programming, so correct me if I'm wrong. But, note that since I know all the operators ahead of compilation time, I can compile all the operators into a single kernel with a small switch statement. IIRC the GPU is usually smart enough to deal with non-diverging branches like this (e.g., return if i == 1 ? cos(x) : sin(x)
) in an efficient way and it won't even be noticeable. It's just diverging branches when you have very different behaviour in each branch where performance gets hit and you need to start worrying about warps.
But if that doesn't work, I guess I can just run every single operator on the data (there's only like 10 operators at most, and a tiny amount of data, so this would be negligible on a modern GPU), mask out the operators not used, then sum.
2 You have the operators fixed, right? You could just compile a kernel when the user defines the operators, like:
function f(data, operator_index)
i = threadIdx ...
return operator_index[i] == 1 ? sin(data[i]) : cos(data[i])
end
right? Or just multiply
sin(data[i]) * (operator_index[i] == 1) + cos(data[i]) * (operator_index[i] == 2)
This lets the GPU do its matrix magic so I think this would be pretty fast. Thoughts on that?
Also check out this for another options: https://discourse.julialang.org/t/optimizing-cuda-jl-performance-for-small-array-operations/54808/4 3 The idea I mentioned above combines both data-wise and equation-wise parallelization. I can evaluate like 100 x Cores trees at once, so this is a good opportunity for more parallelism. The binary heap idea I described above should allow you to work on many trees at once with a single kernel.
Basically: let i index tree, j index node (in binary heap format), k index data feature, and l index data row. In this structure, j=1
is the root note, j=2, j=3
the children, j=4, j=5
the children of 2
, and so on.
Then, define OP[i, j]
is the operator index at every node in every tree. And 0 for the lack of an operator existing. Define DEG[i, j]
as the degree of every node in every tree. CONST[i, j]
as the value in each node (if such value exists, otherwise just 0). FEAT[i, j]
for the feature index to use. Let X[k, l]
be your data array.
Then, you can basically do this all in a single kernel. Let each thread use OP
and DEG
to select the operator in the kernel (again; this is a tiny switch statement), and then use the pre-compiled instruction for computation, and X[FEAT[i, j], :]
and CONST[i, j]
for data.
The only tricky bit is getting threads to wait at each depth level before their parents have completed.
Thoughts on this?
See this answer too: https://stackoverflow.com/q/41009824/2689923.
Anyways, this kind of stuff will really have no effect on such a small dataset x number of trees. On any modern NVIDIA GPU, even running an additional 99 threads for every 1 operation here will not put it close to full utilization. Really just want to minimize the number of kernel launches and data transfers
Hi, @MilesCranmer, many thanks for the useful info. But I still do not understand how you can execute multiple trees simultaneously even with a fixed set of operations. In your example above
function f(data, operator_index)
i = threadIdx ...
return operator_index[i] == 1 ? sin(data[i]) : cos(data[i])
end
The above code will lead to branch divergence inside a warp. Say, for thread 1 to 32 in a warp, the instructions must operate in a lock-step way: when a thread executes sin
, another thread cannot execute cos
. Thus I think it essentially runs the 32 trees sequentially rather than in parallel. It is essentially equivalent to run 32 different functions, one for each thread, in a single warp.
As for this one sin(data[i]) * (operator_index[i] == 1) + cos(data[i]) * (operator_index[i] == 2)
, there seems no divergence, but the cost is that each operator has to be computed even if it is not used, right? By contrast, if the compiler is smart enough to eliminate multiplication by false
, then we come back to the above case again.
The problem I am currently investigating looks like a reinforcement learning one, e.g., given an initial state s0
, I try to evolve a decision-maker (i.e., a policy) pi
that maps a state to an action in sequential decision making to get the final reward. That is, there is only one data sample, i.e., s0
, but I need to parallelize the execution of multiple trees. It is easy to do this kind of parallelism on a CPU, but a normal CPU has only about 8 cores.
Your suggestion on a fixed set of operators and switch-case
seems to be a workaround for my above question on Julia Discourse. My understanding is that we can execute different trees concurrently by assigning each tree to a warp (which occupies 32 threads) since different warps are not step-locked. Inside each wrap, to make full use of the 32 threads, we may consider 32 initial states for my example above. That is, you have a tree (a policy), and this single tree plays 32 games in parallel inside a single warp. Different trees run in different warps in parallel as well.
I need to emphasize this point again: I don't think any of this matters. Modern GPUs have such a ridiculous number of cores and are so highly optimized, that when writing kernels, you can usually completely ignore huge chunks of threads and just have the GPU discard them. For SymbolicRegression.jl, I don't expect one to ever fully utilize say a v100 GPU, even if you are doing things in a very redundant way; computing every single operator and then masking. But if you want to check, you could compare the performance of a kernel with this branch and without... I don't think it will be much different.
But back to this question about operators. Again I don't think it matters, but I am just curious about what you mean.
I guess what I am trying to understand is... why do you consider (o[i] ==1) ? sin(x[i]) : cos(x[i])
as a performance-hurting branch divergence? This is such a simple conditional... Even the sin
operation itself contains lookup tables for different domains of input... so why don't you consider computing sin
a branch divergence too?
For example, this is an implementation of sin
in C (http://www.netlib.org/fdlibm/k_sin.c):
#ifdef __STDC__
double __kernel_sin(double x, double y, int iy)
#else
double __kernel_sin(x, y, iy)
double x,y; int iy; /* iy=0 if y is zero */
#endif
{
double z,r,v;
int ix;
ix = __HI(x)&0x7fffffff; /* high word of x */
if(ix<0x3e400000) /* |x| < 2**-27 */
{if((int)x==0) return x;} /* generate inexact */
z = x*x;
v = z*x;
r = S2+z*(S3+z*(S4+z*(S5+z*S6)));
if(iy==0) return x+v*(S1+z*r);
else return x-((z*(half*y-v*r)-y)-v*S1);
}
but according to how you define a performance-hurting branching statement in your comment above, one can't even compute sin
in the same warp since it contains an if statement... So how can one argue that it is impossible to execute (o[i] ==1) ? sin(x[i]) : cos(x[i])
in a warp, if simple stdlib operators themselves also contain such simple branching tables?
Cheers, Miles
@MilesCranmer You are right that we generally cannot make full use of the modern GPU's computation power. I understand that sometimes branching is not avoidable and we can of course involve conditional statements in the kernel. My point is that I want to evolve one generation as quickly as possible rather than care about the utilization rate of the GPU.
Let's say you have 32 trees. Scheme A is to launch a kernel with 32 threads and evaluate each tree per thread. Scheme B is to launch a kernel with 32*32 threads and evaluate the trees in thread 1, 33, 65... etc.
function schemeB(trees)
if threadIdx.x % 32 == 1
evaluate trees[threadIdx.x ÷ 32 + 1]
end
end
Though neither scheme makes full use of the GPU, I believe scheme B is faster, because the trees are executed in parallel, each in a different warp (far fewer divergence even if you consider the internal details like sin
), while scheme A is slow because there are obviously a lot of divergences caused by different instructions in the trees themselves. The bad part of scheme B is that the remaining 31 threads in a warp actually do nothing; but as you emphasized, there are quite a lot of cores in a GPU and it may still be enough as long as the number of individuals in GP is not too large.
Ah, I see what you mean. That is an interesting idea. I'm also not sure which scheme would be faster. But it would be easy to test this out.
But I think if you actually need to worry about warps and kernel tuning, that means you already have a really really good GPU code and are at high utilization. The things that will bottleneck you much much before that are: (1) lack of parallelism, and (2) data transfer. I think once those two are solved, then you can start worrying about warps... e.g., I think if I solve (1) with this binary heap idea, I'll get maybe a 1,000x speed improvement over the current method which is launching a kernel for each node of each tree. By that point I'll probably be CPU bound again, and won't have to worry about warps anways :P
@MilesCranmer We are on the same channel now.
launching a kernel for each node of each tree
That is obviously inefficient if the dataset is small. Perhaps a custom kernel may help. Besides, how about generating a single expression for each tree, which I mentioned days ago? For example, first compiling the tree into a f(x) = sin.(cos.(x) .+ x .- x.*x)
can leverage the fuse capacity of Julia, and I guess only a single kernel is launched. Though the eval
is typically slow, it may still win due to expression fusing.
Yeah, custom kernel is probably the way to go. And a binary heap too; then I can load an array of trees as a 2D array of integers basically. Then vectorization is super easy, because I assume every tree has the same topology, and simply mask out parts of the tree which are undefined.
Generating an expression for each tree will be really really slow, unfortunately. I mentioned this earlier. That forces Julia to re-compile the kernel every single launch!! Basically: don't work with expressions. Work with a custom tree data type containing lightweight information that tells pre-compiled recursive functions what to do.
What I do is compile an evaluator
function for every operator, indexed with function evaluator(..., ::Val{i}) where i
. The ::Val{i}
trick forces the re-compilation for each operator, so things are fixed.
Then I recursively call the evaluator
function traversing through the tree. So there is no further compilation after the first few evaluations, and things are fast.
Btw @ShuhuaGao - not sure if you meant this; did you mean using eval
and Meta.parse
once or every tree?
So, using them for every tree is really slow. But I think using them once at the beginning of a search (i.e., once the user declares the options) is actually a good idea - it's basically like writing code at runtime!
So you could basically take the user-defined operators, generate a kernel for them, and compile it.
Example of my binary heap idea below. Interested to hear your thoughts.
Here's a tree:
julia> printTree(tree, options)
(cos((x1 * 3.0) - x3) + 2.0)
This tree, assuming the following options:
is equal to this in array format:
EquationHeap([1, 1, 0, 3, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0], Float32[0.0, 0.0, 2.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 3.0], [0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 0, 0, 0, 0, 1, 0], [2, 1, 0, 2, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0])
assuming the following struct:
mutable struct EquationHeap
operators::Array{Int, 1}
constants::Array{CONST_TYPE, 1}
features::Array{Int, 1} #0=>use constant
degree::Array{Int, 1} #0 for degree => stop the tree!
EquationHeap(n) = new(zeros(Int, n), zeros(CONST_TYPE, n), ones(Int, n), zeros(Int, n))
end
and this choice of options:
julia> options = SymbolicRegression.Options(binary_operators=(+, *, -, /),
unary_operators=(cos, exp),
npopulations=30);
(which determines what index corresponds to what operator)
Here's a function to turn this tree into an EquationHeap
object, which is just arrays:
function populate_heap!(heap::EquationHeap, i::Int, tree::Node)::Nothing
if tree.degree == 0
heap.degree[i] = 0
if tree.constant
heap.constants[i] = tree.val
return
else
heap.features[i] = tree.feature
return
end
elseif tree.degree == 1
heap.degree[i] = 1
heap.operators[i] = tree.op
left = 2 * i
populate_heap!(heap, left, tree.l)
return
else
heap.degree[i] = 2
heap.operators[i] = tree.op
left = 2 * i
right = 2 * i + 1
populate_heap!(heap, left, tree.l)
populate_heap!(heap, right, tree.r)
return
end
end
function heapify_tree(tree::Node, max_depth::Int)
max_nodes = 2^(max_depth-1) + 1
heap = EquationHeap(max_nodes)
populate_heap!(heap, 1, tree)
return heap
end
Thus, we get an array form for an arbitrary equation. So we can just run through ~1000 equations in a single kernel call.
Tbh; would probably be even faster if we work with sparse arrays too.
@MilesCranmer Thanks for the sharing and sorry for my late reply (it is Chinese New Year holiday now).
As you know, the most time-consuming part of evolutionary computation is the evaluation. So my main concern is how to conduct evaluation on a GPU. In your example above, each tree is essentially encoded in an array (i.e., binary heap). These trees can be easily copied to the GPU for evaluation.
During the evaluation, how would you actually execute an operator? For example, given an op
encountered, how to get the underlying function since we cannot use syntax like opeators[op]
, where operators
is a list of functions, in a kernel? Should it be hardcoded in the kernel as follows?
if op == 3
operator = sin
elseif op == 4
operator = cos
...
If so, the set of available operators must be fixed, right? It seems that your python wrapper allows custom operators.
Also, in the most common case of symbolic regression with an array of inputs X
containing many samples, how to avoid the production of temporary arrays at each node during evaluation? Possibly by reusing preallocated arrays? I have no clear idea regarding these issues now.
Overall, the idea of encoding GP trees into plain arrays is great to move them between CPU and GPU. My previous naive implementation defines a struct Node
that has a Function
as one of its fields directly, but unfortunately, such Node
cannot be sent to GPU since it is not bits type.
No worries; happy Chinese New Year :)
Should it be hardcoded in the kernel as follows?
Exactly like this. This is how SymbolicRegression.jl already works.
If so, the set of available operators must be fixed, right?
Yes. This is done already.
It seems that your python wrapper allows custom operators.
Yes. As does the Julia backend. There is no inconsistency between this statement, and the above statement. The Options
struct is immutable, and changes type every time you change the functions: https://github.com/MilesCranmer/SymbolicRegression.jl/blob/77cb208345f171d0db4ef3aacbbff5b3b554cfb1/src/Options.jl#L97-L100. This causes the evaluation code to specialize (imagine the Julia compiler seeing the list of Options, and then writing out that if
statement manually) to a particular choice of operators! And re-compile if you change the Options
on your second run.
Possibly by reusing preallocated arrays?
Yeah, this is what SymbolicRegression.jl already does. This is really important for the fast evaluation speed.
My previous naive implementation defines a struct Node that has a Function as one of its fields directly, but unfortunately, such Node cannot be sent to GPU since it is not bits type.
Putting Function
into the Node makes the code a lot slower. I learned this from an earlier implementation.
I would just fork SymbolicRegression.jl and use the internal types! It's already got so many of these hard-fought optimization tricks that I found via experimentation, and once you have something working, it can be cross-compatible, so less rewritten code. I want to use general program synthesis too :)
Should it be hardcoded in the kernel as follows? Exactly like this. This is how SymbolicRegression.jl already works.
I do not find the hardcoded
routine in your code. Possibly we have different meanings: I mean really hardcoded if-else
. For example, see the following function
https://github.com/MilesCranmer/SymbolicRegression.jl/blob/77cb208345f171d0db4ef3aacbbff5b3b554cfb1/src/EvaluateEquation.jl#L101
there are
op = options.unaops[op_idx]
...
x = op(cumulator[j])::T
which can be simplified to options.unaops[op_idx](arg)
. However, if am not wrong, such a syntax is not allowed in a CUDA kernel (see this discourse I posted several days ago).
Now let's try the Val
trick.
using CUDA
function get_operator(::Val{1})
return sin
end
function get_operator(::Val{2})
return cos
end
function process(x::Float32)
for i in (1, 2)
op = get_operator(Val(i))
x = op(x)
@cuprintln("x = $x")
end
nothing
end
and launch the kernel
@cuda process(2.3f0)
Unfortunately, the above code snippet is not applicable to a GPU.
LoadError: InvalidIRError: compiling kernel process(Float32) resulted in invalid LLVM IR
Reason: unsupported call to the Julia runtime (call to jl_f_apply_type)
For now, I guess only the hardcoded kernel below can work.
function process_hard(x::Float32, i::Int)
if i == 1
op = CUDA.sin
elseif i == 2
op = CUDA.cos
else
op = identity
end
x = op(x)
@cuprintln("x = $x")
nothing
end
@cuda process_hard(2.3f0, 33);
There is a long if-else
branching for a given operator set.
I do not find the hardcoded routine in your code. Possibly we have different meanings: I mean really hardcoded if-else.
This is hard-coded :) See this thread where I learned this technique: https://discourse.julialang.org/t/meta-programming-an-if-else-statement-of-user-defined-length/53525
Also, see this: https://discourse.julialang.org/t/optimizing-cuda-jl-performance-for-small-array-operations/54808/4 for selection functions. I think something about your example code is broken. You should be able to do this though.
But if you don't want to do those tricks, I think you can always generate a string that defines your function and pass it to the eval(Meta.parse(...))
. You obviously don't want to do that very often, but if you use it to set up your kernel (i.e., once per runtime), my guess is that it's okay.
See, here is a function I generate and compile:
(again; if you just do this once, it's fine! just not once for every tree)
Did you mean dispatching on the integer value with ::Val{i}
? It is similar to hardcoded version, but does not work with CUDA
unless i
is a literal constant.
The second reference you listed is not quite relevant. There seems to be much different between CPU and GPU. The core problem I don't understand is, given an integer i
(e.g., one in the binary heap), how to choose the proper operator in a CUDA kernel. A naive fs[i]
is not allowed. The ::Val{i}
trick does not work, either. The only clumsy workaround I know for now is to hard code an if-else sequence.
For instance, supposing there are three operators sin, cos, identity
, how will you launch a kernel as follows to apply the i
-th operator?
function process(x::Float32, i::Int)
op = ???
x = op(x)
@cuprintln("x = $x")
nothing
end
@cuda process(3.2f0, 1)
Of course, generating the if-else
kernel with eval
is an option that increases the flexibility and only needs to be done once before evolution. This is also exactly what I am considering.
Hm, yeah I guess just going with the eval
trick is the best bet then! To be honest, I think it's even easier to interpret than using Val{i}
, and also more extensible.
If you get a kernel working for this using heaps, want to submit it to the cuda
branch? :)
Hi, @MilesCranmer , sure, will let you know. We have a master student working on this now.
Hi, @MilesCranmer. This is not an issue but I am writing to ask for your help on some design considerations.
In symbolic regression code, we usually use some protected version of primitive functions to avoid numerical errors like division by zero. Of course, division by zero is allowed in Julia and the
NaN
value can be propagated. I looked into your Operators.jl and found that most functions are not protected. I am not sure whether you handle potential errors somewhere else.Consider the following case using functions in Operators.jl.
which should throw an error as follows
Also, the
exp
function may produce huge values or evenInf
if the input is a large number, which can affect numerical stability.Currently, I am also working on symbolic regression but using a different genetic programming approach. It would be helpful if you can discuss how you handle such errors.