Help on handling numeric errors like Inf

ShuhuaGao commented 3 years ago

Hi, @MilesCranmer. This is not an issue but I am writing to ask for your help on some design considerations.

In symbolic regression code, we usually use some protected version of primitive functions to avoid numerical errors like division by zero. Of course, division by zero is allowed in Julia and the NaN value can be propagated. I looked into your Operators.jl and found that most functions are not protected. I am not sure whether you handle potential errors somewhere else.

Consider the following case using functions in Operators.jl.

sin(div(1.0, 0.0))

which should throw an error as follows

julia> sin(1.0 / 0.0)
ERROR: DomainError with Inf:
sin(x) is only defined for finite x.

Also, the exp function may produce huge values or even Inf if the input is a large number, which can affect numerical stability.

julia> exp(1000)
Inf

Currently, I am also working on symbolic regression but using a different genetic programming approach. It would be helpful if you can discuss how you handle such errors.

MilesCranmer commented 3 years ago

Hi @ShuhuaGao,

Great question! This is actually (and perhaps surprisingly) a very essential part of the quick evaluation speed of SymbolicRegression.jl, since detecting these lets the search quickly throw out equations that have blow-ups, reducing the equation search space. Indeed, Infs and NaNs occur quite a bit due to exploring functions like exp and pow with arbitrary arguments. But rather than these being a problem, they are actually beneficial to the search if you treat them correctly.

My finding is that it's slower to write protected functions; it's better to detect blow-ups in the evaluation code, and throw out the equation that produced them. This seems to keep the population leaner.

Equations are evaluated in this package by traversing the trees using this function: https://github.com/MilesCranmer/SymbolicRegression.jl/blob/1cd40f91cf0aae0dffd32faf12980d863a042a7a/src/EvaluateEquation.jl#L14-L34 (operators get fused to reduce memory allocation; hence all the different options, but there's really just three possibilities for degree=0,1,2 operators)

Each of these different function evaluators returns the type:

Tuple{AbstractVector{T}, Bool}

The Bool part of this is a flag means that a NaN or Inf was detected during computation. Once detected, the evaluator immediately breaks out of the evaluation loop, and returns (array, false). This flag is returned to the evaluation of the parent node, which then also returns with (array, false) before any computation, and so on.

The function calling evalTreeArray checks whether the evaluation completed, and if it hasn't, scores the equation with a very large loss, which essentially means it is thrown out of the next generation.

So basically this means much less computation, and I throw out unstable equations before they produce children.

Since I break immediately on detection, I never have the case where I am passing Inf or NaN to an operator - I always break before this happens.

Does this help?

Btw, want to collaborate on whatever you are working on? Shoot me an email if you want to chat: miles.cranmer@gmail.com. The core algorithm of this package is in src/Mutate.jl. It takes in an equation, and returns the updated/mutated one. Could the algorithm you have in mind be used in place of this code or does it require completely different data structures?

Cheers, Miles

ShuhuaGao commented 3 years ago

Thanks for the quick response. I am trying to implement "Cartesian genetic programming (CGP)", a variant of the tree-based GP. Symbolic regression is the most widely used application, and I am using it to test my implementation, though my final goal is not symbolic regression from X to Y but to learn a formula, e.g., to play a game. You may check a naive implementation in Python, which acts as a game AI for flappy bird, here.

Your SymbolicRegression.jl is excellent in speed. Currently, I am still debugging and profiling my CGP implementation (not open-sourced yet). As for the evaluation, I took a different way. I first translate the tree into a Julia function in its string form. For example, given a tree of root "+" and two leaves "x" and "y", I first produce the following function (as a string)

function f(x, y)
  return x+ y
end

In general, each node corresponds to one line of code. Next, the function string is compiled into an executable function with Meta.parse and eval. I don't know whether you tried this method, but I guess it may be faster, especially in a more general setting, where you cannot simply vectorize, than traversing the tree each time the function is called. For example, again, when the function is called many times in a game play until the game ends. Of course, I believe your approach is faster in the specific symbolic regression problem X --> Y due to ready vectorization. In my current workflow, however, each sample of X is processed separately. But it is possible to accept vectors as well, just like yours, by changing x+y above to x .+ y.

Let's come back to your reply above. Could you make some further explanation?

Once detected, the evaluator immediately breaks out of the evaluation loop, and returns (array, false).

Do you mean that the output of each node is checked and the evaluation terminates once a NaN or Inf appears?

Tuple{AbstractVector{T}, Bool}

I do not notice that you check the Bool flag in the code you posted above.

My finding is that it's slower to write protected functions; it's better to detect blow-ups in the evaluation code, and throw out the equation that produced them

What do you exactly mean by throw out? Do you mean assigning the tree (i.e., individual) a very bad fitness such that it is unlikely to get selected to produce offspring? Or do you mean to eliminate the individual from the population completely? My feeling is that, even if a tree generates NaN or Inf, it may still be useful after mutation or crossover. I assume that you are using Koza's classic GP.

My current implementation of CGP and symbolic regression using CGP is very slow (just finished two hours ago 😜). I will debug and profile the code in the next few days. I am not sure whether it is caused by numerical issues like NaN (no special handling for now) or the Meta.parse & eval workflow or something else. Nonetheless, I agree that your handling of numerical issues is better than protected operations.

MilesCranmer commented 3 years ago

Nice, CGP is pretty interesting!! It uses gradients, right?

That project sounds really exciting! I'm interested to see how it goes. Btw you should check out this paper: https://astroautomata.com/paper/symbolic-neural-nets/. You basically learn a neural network on your problem first, then extract symbolic equations from the net second. I think this is an easy way to learn equations for general problems using a simple Symbolic regressor, rather than needing to write more complex regressors.

Next, the function string is compiled into an executable function with Meta.parse and eval.

I actually tried this at one point. It was extremely slow: quite literally like 1,000,000x slower than traversing the tree using pre-compiled recurisve functions. I think even pure-Python & numpy will be faster than parsing expression trees like this, since Julia's compile speed takes some time. Basically: Julia = slow compile + fast evaluation. Python = zero compile + medium evaluation. But if you need to re-compile things in Julia (=eval), you lose the benefits.

I definitely recommend taking a similar approach to this package! It is not too bad to write a simple version; here's a simplified version of how I evaluate:

function evalTree(X, tree)
    if tree.degree == 0
        if tree.constant
            return tree.value
        end
        return X[tree.feature]
    elseif tree.degree == 1
        x = evalTree(X, tree.left)
        return unary_operators[tree.op](x)
    else
        x = evalTree(X, tree.left)
        y = evalTree(X, tree.right)
        return binary_operators[tree.op](x, y)
    end
end

Do you mean that the output of each node is checked and the evaluation terminates once a NaN or Inf appears?

Exactly!

Or do you mean to eliminate the individual from the population completely?

I am using regularized evolution with my own spin on it - I do simulated annealing as well. (I still need to write a paper or technical doc on how this all works... date TBD)

Basically it amounts to this:

Pick 10 random equations.
Select the best equation of that subset.
Mutate it, and evaluate the new equation's score.
Decide whether to accept the new equation with simulated annealing.
If accept, replace the oldest member of the population.

So basically if there is a Nan or Inf, I just reject the new equation during step 4. It's the same as throwing it out completely. I'm not sure if this prevents the search from finding certain equations, but it seems to really improve the speed.

In the code in EvaluateEquations.jl, I check in the auxiliary functions rather than the main if statement. e.g., here: https://github.com/MilesCranmer/SymbolicRegression.jl/blob/50c2bd3cd325924cf748263271147b8ad10054f2/src/EvaluateEquation.jl#L55

MilesCranmer commented 3 years ago

Another option is to basically fork this package, and then change the data type from AbstractVector{T} to be whatever datatype you need in your application, and change to use operators for your specific data structure. Then you get to keep the same mutation functions, evolution, and distributed search code.

The nice thing about Julia is types are really easy to manipulate so you can make code very generic to different applications

MilesCranmer commented 3 years ago

Nice, CGP is pretty interesting!! It uses gradients, right?

(Sorry, I was thinking about differential CGP; which is a different symbolic regression software)

But in general I am also interested in adapting SymbolicRegression.jl to general programs rather than just expressions... Let me know if you want to collaborate on this. SymbolicRegression.jl doesn't use gradients anywhere, so in principle this should already be possible!

ShuhuaGao commented 3 years ago

Thank you very much for the detailed explanation. It really helps. I guess the slow speed is due to eval but need profiling to confirm it. I have played with some Python packages before and the eval is used there, but as you said: "Basically: Julia = slow compile + fast evaluation. Python = zero compile + medium evaluation.".

You are right that the tree structure in your package can in principle be extended beyond symbolic regression. I will focus on CGP for now, because CGP seems to be simpler yet more flexible than the classic tree GP. As you mentioned already, it can be made differentiable. It is always a pain to learn numeric coefficients in general genetic programming. Of course, I believe it is also possible to make SymbolicRegression.jl differentiable w.r.t some numeric weights thanks to the powerful automatic differentiation infrastructure in Julia.

Another reason that I adopt the eval method is that I want to leverage GPU with CUDA.jl; otherwise the tree itself has to be copied to the GPU for execution as well. GPU is a massively parallel device and seems ideal for evolutionary computation even if we do not have access to a CPU cluster. You may also consider this direction in the future development. If the interface AbstractVector{T} is used exclusively, then SymbolicRegression.jl should already support CuArray in CUDA.jl which can accelerate large datasets.

I will play with CGP first for now. If the performance is satisfactory, I can provide it as another backend of your pysr. If not, it is a good idea to fork your repository for further development. (The goal is not only to solve a particular problem but also to practice my Julia skills; and that's why I want to write it from scratch 😃).

ShuhuaGao commented 3 years ago

I never tried vector inputs/outputs in GP trees before. May I ask you some questions?

Does your code produce an intermediate variable (also a vector) for every node? We know that Julia can fuse "dot" operators like d = a .+b .+ c. In a naive tree traversal, we may first temp = a .+ b and then d = temp .+ c. However, the allocation of a temporary vector temp may be expensive. How do you leverage the fuse capability if any in the evaluation of a tree?
The operator at each node is vectorized by "dot", right? In that case, if m input variables in X have been picked a tree, then we select the m-variable subset from X to feed it into the tree, and the final output of the tree should also be m-dim, right? In this case, do you perform some extra reduction operation at the end to get a single-dim output? For example, something like scaled symbolic regression, which also helps to identify more accurate numeric coefficients. In plain words, GP does feature engineering here and we perform a linear regression on the transformed features to approximate the target variables Y.

MilesCranmer commented 3 years ago

Sounds good, look forward to seeing it.

Yes, CuArray <: AbstractArray (is a subset of) so in principle a lot of the equation evaluation is already ready for GPUs. But I probably need to use the dot notation instead of my manual loops so the CUDA kernels get generated properly.

For these other questions you should post on the Julia discourse since the questions are general to Julia and not specific to my package; you’ll probably get more helpful answers from people who know more than me!

Re: memory allocation, this is basically done via a .= op.(a, b) at each node. So you aren’t creating new arrays, since you re-use a by .= operator.

MilesCranmer commented 3 years ago

Btw; I got this package working on CUDA. Julia really makes this so easy, it's awesome. It took me like 10 minutes to get it working!

It's on the cuda branch if you want to check it out.

ShuhuaGao commented 3 years ago

@MilesCranmer That's great. There are very few open-source packages that can work on CUDA. Yours may be the first one! It would be better to show performance benchmarking results in your documentation.

ChrisRackauckas commented 3 years ago

Let me know if you need access to the GPU CI servers.

MilesCranmer commented 3 years ago

Thanks @ChrisRackauckas; will do!

cc @ShuhuaGao Regarding the CUDA stuff, I think the best option is to batch over all trees at once in a population (e.g., for 1000 individuals, one could shuffle them pick the best individual among each 10th, mutate 100 trees, then update those tree's scores at once rather than consecutively). But this will be more work:

To do this efficiently, I basically need to put every single tree into a binary heap format (i.e., assume every equation is a full binary tree up to the maximum depth; have separate heap arrays for to index operators, features, etc), stack those into a matrix, and evaluate them all at once in a single kernel (use the operator array to pick which operator to use inside each CUDA thread, etc). I think is one way to get the full speed out of a GPU. But this will obviously be tricky to set up.

ShuhuaGao commented 3 years ago

@MilesCranmer Thanks a lot. But note that

CUDA adopts a SIMT (single instruction multiple threads) parallel model, which is unlike the straightforward CPU parallelism. You cannot simply execute multiple different trees simultaneously in a single kernel (more exactly, in a warp), since these trees correspond to different instructions. The solution that seems promising to me is to execute one tree in a single warp (corresponding to 32 lock-step threads). For instance, if 2048 threads are allocated for a kernel, then at most 64 trees may be manipulated simultaneously.
About "use the operator array to pick which operator to use inside each CUDA thread", I asked a question on Julia discourse. Overall, it is very tricky, if possible, to pick operators on the GPU. I have not figured out a solution yet.
The normal coding paradigm with CUDA is to launch one kernel for a fixed flow of instructions that target multiple data similar to vectorization. To parallelize among individuals, we actually want to run multiple different functions (encoded by trees) in parallel, which is a much more challenging task. A naive method is to launch multiple kernels but that seems to be of low efficiency. I think that is why almost all GPU-based GP applications in the literature are data-parallel instead of individual-parallel. A data-parallel way should fit your current SymbolicRegression very well.

MilesCranmer commented 3 years ago

@ShuhuaGao - thanks; really useful info!

1 It's been a few years since I've done serious C++ CUDA programming, so correct me if I'm wrong. But, note that since I know all the operators ahead of compilation time, I can compile all the operators into a single kernel with a small switch statement. IIRC the GPU is usually smart enough to deal with non-diverging branches like this (e.g., return if i == 1 ? cos(x) : sin(x)) in an efficient way and it won't even be noticeable. It's just diverging branches when you have very different behaviour in each branch where performance gets hit and you need to start worrying about warps.

But if that doesn't work, I guess I can just run every single operator on the data (there's only like 10 operators at most, and a tiny amount of data, so this would be negligible on a modern GPU), mask out the operators not used, then sum.

2 You have the operators fixed, right? You could just compile a kernel when the user defines the operators, like:

function f(data, operator_index)
    i = threadIdx ...
    return operator_index[i] == 1 ? sin(data[i]) : cos(data[i])
end

right? Or just multiply

sin(data[i]) * (operator_index[i] == 1) + cos(data[i]) * (operator_index[i] == 2)

This lets the GPU do its matrix magic so I think this would be pretty fast. Thoughts on that?

Also check out this for another options: https://discourse.julialang.org/t/optimizing-cuda-jl-performance-for-small-array-operations/54808/4 3 The idea I mentioned above combines both data-wise and equation-wise parallelization. I can evaluate like 100 x Cores trees at once, so this is a good opportunity for more parallelism. The binary heap idea I described above should allow you to work on many trees at once with a single kernel.

Basically: let i index tree, j index node (in binary heap format), k index data feature, and l index data row. In this structure, j=1 is the root note, j=2, j=3 the children, j=4, j=5 the children of 2, and so on.

Then, define OP[i, j] is the operator index at every node in every tree. And 0 for the lack of an operator existing. Define DEG[i, j] as the degree of every node in every tree. CONST[i, j] as the value in each node (if such value exists, otherwise just 0). FEAT[i, j] for the feature index to use. Let X[k, l] be your data array.

Then, you can basically do this all in a single kernel. Let each thread use OP and DEG to select the operator in the kernel (again; this is a tiny switch statement), and then use the pre-compiled instruction for computation, and X[FEAT[i, j], :] and CONST[i, j] for data.

The only tricky bit is getting threads to wait at each depth level before their parents have completed.

Thoughts on this?

MilesCranmer commented 3 years ago

See this answer too: https://stackoverflow.com/q/41009824/2689923.

Anyways, this kind of stuff will really have no effect on such a small dataset x number of trees. On any modern NVIDIA GPU, even running an additional 99 threads for every 1 operation here will not put it close to full utilization. Really just want to minimize the number of kernel launches and data transfers

ShuhuaGao commented 3 years ago

Hi, @MilesCranmer, many thanks for the useful info. But I still do not understand how you can execute multiple trees simultaneously even with a fixed set of operations. In your example above

function f(data, operator_index)
    i = threadIdx ...
    return operator_index[i] == 1 ? sin(data[i]) : cos(data[i])
end

The above code will lead to branch divergence inside a warp. Say, for thread 1 to 32 in a warp, the instructions must operate in a lock-step way: when a thread executes sin, another thread cannot execute cos. Thus I think it essentially runs the 32 trees sequentially rather than in parallel. It is essentially equivalent to run 32 different functions, one for each thread, in a single warp.

As for this one sin(data[i]) * (operator_index[i] == 1) + cos(data[i]) * (operator_index[i] == 2), there seems no divergence, but the cost is that each operator has to be computed even if it is not used, right? By contrast, if the compiler is smart enough to eliminate multiplication by false, then we come back to the above case again.

The problem I am currently investigating looks like a reinforcement learning one, e.g., given an initial state s0, I try to evolve a decision-maker (i.e., a policy) pi that maps a state to an action in sequential decision making to get the final reward. That is, there is only one data sample, i.e., s0, but I need to parallelize the execution of multiple trees. It is easy to do this kind of parallelism on a CPU, but a normal CPU has only about 8 cores.

Your suggestion on a fixed set of operators and switch-case seems to be a workaround for my above question on Julia Discourse. My understanding is that we can execute different trees concurrently by assigning each tree to a warp (which occupies 32 threads) since different warps are not step-locked. Inside each wrap, to make full use of the 32 threads, we may consider 32 initial states for my example above. That is, you have a tree (a policy), and this single tree plays 32 games in parallel inside a single warp. Different trees run in different warps in parallel as well.

MilesCranmer commented 3 years ago

I need to emphasize this point again: I don't think any of this matters. Modern GPUs have such a ridiculous number of cores and are so highly optimized, that when writing kernels, you can usually completely ignore huge chunks of threads and just have the GPU discard them. For SymbolicRegression.jl, I don't expect one to ever fully utilize say a v100 GPU, even if you are doing things in a very redundant way; computing every single operator and then masking. But if you want to check, you could compare the performance of a kernel with this branch and without... I don't think it will be much different.

But back to this question about operators. Again I don't think it matters, but I am just curious about what you mean.

I guess what I am trying to understand is... why do you consider (o[i] ==1) ? sin(x[i]) : cos(x[i]) as a performance-hurting branch divergence? This is such a simple conditional... Even the sin operation itself contains lookup tables for different domains of input... so why don't you consider computing sin a branch divergence too?

For example, this is an implementation of sin in C (http://www.netlib.org/fdlibm/k_sin.c):

#ifdef __STDC__
    double __kernel_sin(double x, double y, int iy)
#else
    double __kernel_sin(x, y, iy)
    double x,y; int iy;     /* iy=0 if y is zero */
#endif
{
    double z,r,v;
    int ix;
    ix = __HI(x)&0x7fffffff;    /* high word of x */
    if(ix<0x3e400000)           /* |x| < 2**-27 */
       {if((int)x==0) return x;}        /* generate inexact */
    z   =  x*x;
    v   =  z*x;
    r   =  S2+z*(S3+z*(S4+z*(S5+z*S6)));
    if(iy==0) return x+v*(S1+z*r);
    else      return x-((z*(half*y-v*r)-y)-v*S1);
}

but according to how you define a performance-hurting branching statement in your comment above, one can't even compute sin in the same warp since it contains an if statement... So how can one argue that it is impossible to execute (o[i] ==1) ? sin(x[i]) : cos(x[i]) in a warp, if simple stdlib operators themselves also contain such simple branching tables?

Cheers, Miles

ShuhuaGao commented 3 years ago

@MilesCranmer You are right that we generally cannot make full use of the modern GPU's computation power. I understand that sometimes branching is not avoidable and we can of course involve conditional statements in the kernel. My point is that I want to evolve one generation as quickly as possible rather than care about the utilization rate of the GPU.

Let's say you have 32 trees. Scheme A is to launch a kernel with 32 threads and evaluate each tree per thread. Scheme B is to launch a kernel with 32*32 threads and evaluate the trees in thread 1, 33, 65... etc.

function schemeB(trees)
  if threadIdx.x % 32 == 1
     evaluate trees[threadIdx.x ÷ 32 + 1]
  end
end

Though neither scheme makes full use of the GPU, I believe scheme B is faster, because the trees are executed in parallel, each in a different warp (far fewer divergence even if you consider the internal details like sin), while scheme A is slow because there are obviously a lot of divergences caused by different instructions in the trees themselves. The bad part of scheme B is that the remaining 31 threads in a warp actually do nothing; but as you emphasized, there are quite a lot of cores in a GPU and it may still be enough as long as the number of individuals in GP is not too large.

MilesCranmer commented 3 years ago

Ah, I see what you mean. That is an interesting idea. I'm also not sure which scheme would be faster. But it would be easy to test this out.

But I think if you actually need to worry about warps and kernel tuning, that means you already have a really really good GPU code and are at high utilization. The things that will bottleneck you much much before that are: (1) lack of parallelism, and (2) data transfer. I think once those two are solved, then you can start worrying about warps... e.g., I think if I solve (1) with this binary heap idea, I'll get maybe a 1,000x speed improvement over the current method which is launching a kernel for each node of each tree. By that point I'll probably be CPU bound again, and won't have to worry about warps anways :P

ShuhuaGao commented 3 years ago

@MilesCranmer We are on the same channel now.

launching a kernel for each node of each tree

That is obviously inefficient if the dataset is small. Perhaps a custom kernel may help. Besides, how about generating a single expression for each tree, which I mentioned days ago? For example, first compiling the tree into a f(x) = sin.(cos.(x) .+ x .- x.*x) can leverage the fuse capacity of Julia, and I guess only a single kernel is launched. Though the eval is typically slow, it may still win due to expression fusing.

MilesCranmer commented 3 years ago

Yeah, custom kernel is probably the way to go. And a binary heap too; then I can load an array of trees as a 2D array of integers basically. Then vectorization is super easy, because I assume every tree has the same topology, and simply mask out parts of the tree which are undefined.

Generating an expression for each tree will be really really slow, unfortunately. I mentioned this earlier. That forces Julia to re-compile the kernel every single launch!! Basically: don't work with expressions. Work with a custom tree data type containing lightweight information that tells pre-compiled recursive functions what to do.

What I do is compile an evaluator function for every operator, indexed with function evaluator(..., ::Val{i}) where i. The ::Val{i} trick forces the re-compilation for each operator, so things are fixed.

Then I recursively call the evaluator function traversing through the tree. So there is no further compilation after the first few evaluations, and things are fast.

MilesCranmer commented 3 years ago

Btw @ShuhuaGao - not sure if you meant this; did you mean using eval and Meta.parse once or every tree?

So, using them for every tree is really slow. But I think using them once at the beginning of a search (i.e., once the user declares the options) is actually a good idea - it's basically like writing code at runtime!

So you could basically take the user-defined operators, generate a kernel for them, and compile it.

MilesCranmer commented 3 years ago

Example of my binary heap idea below. Interested to hear your thoughts.

Here's a tree:

julia> printTree(tree, options)
(cos((x1 * 3.0) - x3) + 2.0)

This tree, assuming the following options:

is equal to this in array format:

EquationHeap([1, 1, 0, 3, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0], Float32[0.0, 0.0, 2.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 3.0], [0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 0, 0, 0, 0, 1, 0], [2, 1, 0, 2, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0])

assuming the following struct:

mutable struct EquationHeap
    operators::Array{Int, 1}
    constants::Array{CONST_TYPE, 1}
    features::Array{Int, 1} #0=>use constant
    degree::Array{Int, 1} #0 for degree => stop the tree!

    EquationHeap(n) = new(zeros(Int, n), zeros(CONST_TYPE, n), ones(Int, n), zeros(Int, n))
end

and this choice of options:

julia> options = SymbolicRegression.Options(binary_operators=(+, *, -, /),
                                            unary_operators=(cos, exp),
                                            npopulations=30);

(which determines what index corresponds to what operator)

Here's a function to turn this tree into an EquationHeap object, which is just arrays:

function populate_heap!(heap::EquationHeap, i::Int, tree::Node)::Nothing
    if tree.degree == 0
        heap.degree[i] = 0
        if tree.constant
            heap.constants[i] = tree.val
            return
        else
            heap.features[i]  = tree.feature
            return
        end
    elseif tree.degree == 1
        heap.degree[i] = 1
        heap.operators[i] = tree.op
        left = 2 * i
        populate_heap!(heap, left, tree.l)
        return
    else
        heap.degree[i] = 2
        heap.operators[i] = tree.op
        left = 2 * i
        right = 2 * i + 1
        populate_heap!(heap, left, tree.l)
        populate_heap!(heap, right, tree.r)
        return
    end
end

function heapify_tree(tree::Node, max_depth::Int)
    max_nodes = 2^(max_depth-1) + 1
    heap = EquationHeap(max_nodes)
    populate_heap!(heap, 1, tree)
    return heap
end

Thus, we get an array form for an arbitrary equation. So we can just run through ~1000 equations in a single kernel call.

MilesCranmer commented 3 years ago

Tbh; would probably be even faster if we work with sparse arrays too.

ShuhuaGao commented 3 years ago

@MilesCranmer Thanks for the sharing and sorry for my late reply (it is Chinese New Year holiday now).

As you know, the most time-consuming part of evolutionary computation is the evaluation. So my main concern is how to conduct evaluation on a GPU. In your example above, each tree is essentially encoded in an array (i.e., binary heap). These trees can be easily copied to the GPU for evaluation.

During the evaluation, how would you actually execute an operator? For example, given an op encountered, how to get the underlying function since we cannot use syntax like opeators[op], where operators is a list of functions, in a kernel? Should it be hardcoded in the kernel as follows?

if op == 3
   operator = sin
elseif op == 4
   operator = cos
...

If so, the set of available operators must be fixed, right? It seems that your python wrapper allows custom operators.

Also, in the most common case of symbolic regression with an array of inputs X containing many samples, how to avoid the production of temporary arrays at each node during evaluation? Possibly by reusing preallocated arrays? I have no clear idea regarding these issues now.

Overall, the idea of encoding GP trees into plain arrays is great to move them between CPU and GPU. My previous naive implementation defines a struct Node that has a Function as one of its fields directly, but unfortunately, such Node cannot be sent to GPU since it is not bits type.

MilesCranmer commented 3 years ago

No worries; happy Chinese New Year :)

Should it be hardcoded in the kernel as follows?

Exactly like this. This is how SymbolicRegression.jl already works.

If so, the set of available operators must be fixed, right?

Yes. This is done already.

It seems that your python wrapper allows custom operators.

Yes. As does the Julia backend. There is no inconsistency between this statement, and the above statement. The Options struct is immutable, and changes type every time you change the functions: https://github.com/MilesCranmer/SymbolicRegression.jl/blob/77cb208345f171d0db4ef3aacbbff5b3b554cfb1/src/Options.jl#L97-L100. This causes the evaluation code to specialize (imagine the Julia compiler seeing the list of Options, and then writing out that if statement manually) to a particular choice of operators! And re-compile if you change the Options on your second run.

Possibly by reusing preallocated arrays?

Yeah, this is what SymbolicRegression.jl already does. This is really important for the fast evaluation speed.

My previous naive implementation defines a struct Node that has a Function as one of its fields directly, but unfortunately, such Node cannot be sent to GPU since it is not bits type.

Putting Function into the Node makes the code a lot slower. I learned this from an earlier implementation.

I would just fork SymbolicRegression.jl and use the internal types! It's already got so many of these hard-fought optimization tricks that I found via experimentation, and once you have something working, it can be cross-compatible, so less rewritten code. I want to use general program synthesis too :)

ShuhuaGao commented 3 years ago

Should it be hardcoded in the kernel as follows? Exactly like this. This is how SymbolicRegression.jl already works.

I do not find the hardcoded routine in your code. Possibly we have different meanings: I mean really hardcoded if-else. For example, see the following function https://github.com/MilesCranmer/SymbolicRegression.jl/blob/77cb208345f171d0db4ef3aacbbff5b3b554cfb1/src/EvaluateEquation.jl#L101 there are

op = options.unaops[op_idx]
...
x = op(cumulator[j])::T

which can be simplified to options.unaops[op_idx](arg). However, if am not wrong, such a syntax is not allowed in a CUDA kernel (see this discourse I posted several days ago).

Now let's try the Val trick.

using CUDA

function get_operator(::Val{1})
    return sin
end

function get_operator(::Val{2})
    return cos
end

function process(x::Float32)
    for i in (1, 2)
        op = get_operator(Val(i))
        x = op(x)
        @cuprintln("x = $x")
    end
    nothing
end

and launch the kernel

@cuda process(2.3f0)

Unfortunately, the above code snippet is not applicable to a GPU.

LoadError: InvalidIRError: compiling kernel process(Float32) resulted in invalid LLVM IR
Reason: unsupported call to the Julia runtime (call to jl_f_apply_type)

For now, I guess only the hardcoded kernel below can work.

function process_hard(x::Float32, i::Int)
    if i == 1
        op = CUDA.sin
    elseif i == 2
        op = CUDA.cos
    else
        op = identity
    end
    x = op(x)
    @cuprintln("x = $x")
    nothing
end

@cuda process_hard(2.3f0, 33);

There is a long if-else branching for a given operator set.

MilesCranmer commented 3 years ago

I do not find the hardcoded routine in your code. Possibly we have different meanings: I mean really hardcoded if-else.

This is hard-coded :) See this thread where I learned this technique: https://discourse.julialang.org/t/meta-programming-an-if-else-statement-of-user-defined-length/53525

MilesCranmer commented 3 years ago

Also, see this: https://discourse.julialang.org/t/optimizing-cuda-jl-performance-for-small-array-operations/54808/4 for selection functions. I think something about your example code is broken. You should be able to do this though.

MilesCranmer commented 3 years ago

But if you don't want to do those tricks, I think you can always generate a string that defines your function and pass it to the eval(Meta.parse(...)). You obviously don't want to do that very often, but if you use it to set up your kernel (i.e., once per runtime), my guess is that it's okay.

See, here is a function I generate and compile:

(again; if you just do this once, it's fine! just not once for every tree)

ShuhuaGao commented 3 years ago

Did you mean dispatching on the integer value with ::Val{i}? It is similar to hardcoded version, but does not work with CUDA unless i is a literal constant.

The second reference you listed is not quite relevant. There seems to be much different between CPU and GPU. The core problem I don't understand is, given an integer i (e.g., one in the binary heap), how to choose the proper operator in a CUDA kernel. A naive fs[i] is not allowed. The ::Val{i} trick does not work, either. The only clumsy workaround I know for now is to hard code an if-else sequence.

For instance, supposing there are three operators sin, cos, identity, how will you launch a kernel as follows to apply the i-th operator?

function process(x::Float32, i::Int)
    op = ???
    x = op(x)
    @cuprintln("x = $x")
    nothing
end

@cuda process(3.2f0, 1)

Of course, generating the if-else kernel with eval is an option that increases the flexibility and only needs to be done once before evolution. This is also exactly what I am considering.

MilesCranmer commented 3 years ago

Hm, yeah I guess just going with the eval trick is the best bet then! To be honest, I think it's even easier to interpret than using Val{i}, and also more extensible.

MilesCranmer commented 3 years ago

If you get a kernel working for this using heaps, want to submit it to the cuda branch? :)

ShuhuaGao commented 3 years ago

Hi, @MilesCranmer , sure, will let you know. We have a master student working on this now.

MilesCranmer / SymbolicRegression.jl

Help on handling numeric errors like Inf #12