Poor scaling for linear MPC example

baggepinnen commented 1 year ago

The following example bilds a function to evaluate the dynamic constraints for a linear MPC problem. The size of the problem can be scaled up and down by modifying N, nx, nu. Already fr rather small problems with 6000 variables, it takes quite long time to build the expressions and making the function

compute cons expression: 2.007440 seconds (9.51 M allocations: 488.824 MiB, 14.34% gc time, 71.54% compilation time)
make_function: 6.885061 seconds (25.78 M allocations: 1.440 GiB, 15.88% gc time, 13.79% compilation time)

but the function takes forever (more than several minutes) to compile.

using FastDifferentiation

N = 100
nx = 40
nu = 20
ny = 20
Ts = 0.01
x = make_variables(:x, nx, N+1)
u = make_variables(:u, nu, N)
x0 = make_variables(:x0, nx)

function rollout(f, x0::AbstractVector, u)
    T = promote_type(eltype(x0), eltype(u))
    x = zeros(T, length(x0), size(u, 2)+1)
    x[:, 1] .= x0
    rollout!(f, x, u)
end

function rollout!(f, x, u)
    for i = 1:size(x, 2)-1
        x[:, i+1] = f(x[:, i], u[:, i])
    end
    x, u
end

const A = 0.1randn(nx,nx)
const B = randn(nx,nu)

f(x, u) = A*x + B*u

function dynamics_cons(x, u)
    xp, _ = rollout(f, x0, u);
    vec(xp .- x)
end

@time "compute cons expression" c = dynamics_cons(x, u);

vars = [x0; vec(x); vec(u)]

@time "make_function" fun = make_function(c, vars; in_place=false)

v = randn(length(vars))
@time fun(v);

There is obviously a lot of structure in this problem, perhaps this structure could be utilized to create more efficient functions?

brianguenter commented 1 year ago

I used the function number_of_operations (not exported from FastDifferentiation but maybe it should be) to count the number of operations in this function. I adjusted the parameters downwards from your original numbers:

#your original numbers
 # N = 100
    # nx = 40
    # nu = 20
    # ny = 20
    # Ts = 0.01 
#new numbers giving smaller graph size
    N = 40
    nx = 40
    nu = 2
    ny = 2
    Ts = 0.01

With the smaller numbers the graph had 134,440 operations; generated code size becomes an issue around this size. With your original parameters the expression would be enormous by the standards of FD.

FastDifferentiation does not currently support looping directly so instead it unrolls all loops. Somewhere in the range of 1e5 to 1e6 operations the generated functions become large enough that FD takes a while to process them.

This limitation on expression graph size is mentioned in the limitations section of the documentation but perhaps it should be emphasized. And maybe I should export number_of_operations so people can measure size accurately.

The compilation time that took several minutes is entirely LLVM. There's no FD processing involved at that stage. I can't do anything about that.

But I can do something about the size of the generated code. I have several ideas for addressing this looping problem but it will be months before you see any of them released.

brianguenter commented 1 year ago

I noticed that you aren't computing any derivatives. It might still make sense to compute derivatives of this function since it appears to be quite sparse.

I don't know which parameters you want to differentiate with respect to but I computed this derivative

 jac = sparse_jacobian(c, vec(x))

for these parameters

 N = 2
    nx = 40
    nu = 2
    ny = 2
    Ts = 0.01

and got sparsity of .0083. Doubling the N parameter reduces sparsity to .005. At the scale of your original problem the sparsity might be a very small number although it would probably take a long time to compute the Jacobian.

This looks like one of those problems where the Jacobian that FD produces might take just a small constant times the number of operations of the original function.

If preprocessing time is a barrier then FD may not work for your application. However, I would not be surprised if the alternative AD algorithms are much slower at run time than the FD generated Jacobian.

brianguenter commented 1 year ago

One last comment. For reasons I don't fully understand the FD graph processing is very slow. This is almost completely unoptimized because I wanted to make the code correct and feature complete before worrying about optimization. But over the next few months I will be looking at this. There won't be miracles but speed might increase by 5x-10x. Won't help with the LLVM compile times though.

brianguenter commented 1 year ago

Does this address your issue? If it does I'll close it.

baggepinnen commented 1 year ago

I did slightly modify the problem structure to make it a whole lot more sparse, the following takes

compute cons expression: 0.059047 seconds (135.67 k allocations: 5.062 MiB, 85.24% compilation time)
make_function: 0.032015 seconds (365.44 k allocations: 23.127 MiB, 44.90% gc time)
first function call: 40.128536 seconds (1.33 M allocations: 62.672 MiB, 0.03% gc time, 100.00% compilation time)
Benchmark:  5.016 μs (2 allocations: 55.11 KiB)

N = 80
nx = 8
nu = 2
ny = 2
x = make_variables(:x, nx, N+1)
u = make_variables(:u, nu, N)
x0 = make_variables(:x0, nx)

const A = 0.1randn(nx,nx)
const B = randn(nx,nu)

f(x, u) = A*x + B*u

function dynamics_cons(x0, x, u)
    c = similar(x)
    c[:,1] = x0 - x[:, 1]
    for i = 1:size(x, 2)-1
        c[:, i+1] = f(x[:, i], u[:, i]) - x[:, i+1]
    end
    vec(c)
end

@time "compute cons expression" c = dynamics_cons(x0, x, u);
vars = [vec(x); vec(u)]
jac = sparse_jacobian(c, vars)
@time "make_function" fun = make_function(jac, vars; in_place=false)
v = randn(length(vars))
@time "first function call" fun(v)
@btime $fun($v);

The first function call (LLVM) time is still very high. The resulting code benchmarks reasonably fast, but considering it's essentially just computing

    Ax = kron(speye(N + 1), -speye(nx)) + kron(spdiagm(-1 => ones(N)), Ad)
    Bu = kron([spzeros(1, N); speye(N)], Bd)
    Aeq = [Ax Bu]

it should be possible to reduce the LLVM compilation time substantially by recognizing this structure

Here's the manually coded Jacobian for completeness

using SparseArrays
speye(N) = spdiagm(ones(N))
function manual_jacobian(N, Ad, Bd)
    (nx, nu) = size(Bd)
    Ax = kron(speye(N + 1), -speye(nx)) + kron(spdiagm(-1 => ones(N)), Ad)
    Bu = kron([spzeros(1, N); speye(N)], Bd)
    Aeq = [Ax Bu]
end
jac_m = manual_jacobian(N, A, B)
jac_a = fun(v)
@assert norm(jac_m - jac_a) == 0 # test that they are equal

I'm exploring he use of FastDifferentiation for this since once you add slack variables and some other niceties to this problem, manually coding the Jacobians for the QP becomes extremely tedious and error prone. To feasibly rely on FD for this. I need it to scale to much larger problem sizes that what it currently does. It may very well be that FD is not the right tool for the job, and that's okay as well, I'm just exploring what's possible :)

brianguenter commented 1 year ago

I'm considering two methods for tackling the scaling problem. The first is what's called rerolling, recognizing indexing patterns in unrolled code and then automatically generating loops from that pattern. It's not necessary to completely undo the unrolling back to the original expression, just to reduce the expression size enough so that LLVM doesn't choke at the compilation stage.

The second is a general framework for differentiating tensor expressions. This uses a different form of the derivative graph which requires some wedging to fit into the FD framework. But, it can automatically and efficiently handle all the common matrix operations, convolution, and so forth. Anything which can be written as a tensor contraction on formulas of indices. This method has the advantage that there is no loop unrolling so expressions remain compact.

How big are your problems, in terms of number of operations?

baggepinnen commented 1 year ago

The latest code above is very representative of linear model-predictive control (if you add a quadratic cost function), small such problems have perhaps nx = 4, N = 10, large problems can easily have nx = 100, N=1000, in particular if the system comes from a PDE discretization. In addition to the above, it's common to add slack variables, which can be thought of as doubling nx.

For nonlinear MPC (where f above is nonlinear in x or u), a common approach is to linearize around a trajectory, solve a linear MPC problem (quadratic program), and then re-linearize around the new trajectotry obtained from the previous optimization (this is called sequential quadratic programming). In this case, the one may or may not need to compute the hessian of the lagrangian where the dynamics_constraint above calls the nonlinear f.

brianguenter commented 1 year ago

What LLVM compilation time would be acceptable? That will give me a target to shoot for.

Do you only have to compute the symbolic derivative once or does it change over time? If it changes over time does the structure of the function remain unchanged? This would be the case, for example, if variable coefficients change but the structure of the equation doesn't. I may be able to do something in this case that would process the constants specially so they can be updated without having to recompute the symbolic derivative and remake the function.

baggepinnen commented 1 year ago

I can already kind-of do these things with Symbolics.jl, and am looking elsewhere due to their poor compile times, so it should be substantially faster than symbolics. In my experience, users (I'm developing software exposed to users though a GUI) really loose patience once you start exceeding 20-30s, I also need to compile not only the constraint Jacobian, but also the Hessian of the lagrangian, so ideally the total time is smaller than that.

The structure never change for these problems, but I do repeatedly compute Jacobians with different parameter values. In my example, x, u change

brianguenter commented 1 year ago

If the times in your first message in this thread are representative of this problem class then the FD processing times are acceptable but the LLVM compile times are not.

Looks like FD will have to produce substantially smaller runtime generated functions to get those LLVM times where you want them. It will happen but as I said it will be a few months.

brianguenter commented 1 year ago

when you say x,u change do you mean these are the variables you take derivatives with respect to. And you use different values of x0 for the same structural Jacobian?

If this is the case there might be a relatively straightforward fix. This is something I can see being generally useful. The initial compile times would still be very long but you wouldn't have to recompute the Jacobian or re-compile the function just because the values for x0 changed. Let me think about this for a few days.

baggepinnen commented 1 year ago

when you say x,u change do you mean these are the variables you take derivatives with respect to. And you use different values of x0 for the same structural Jacobian?

Correct. In general, there may be several problem-specific parameters that can change between calls to the MPC controller. The gradient is not taken w.r.t. those parameters, but they do change between instances. Setting the initial condition of the MPC controller each iteration is one use cases of this kind of non-differentiated parameter, https://github.com/brianguenter/FastDifferentiation.jl/issues/13#issuecomment-1590412452 lists some other

brianguenter commented 1 year ago

I noticed that the Jacobian of c is contant so all the runtime generated function does is set constant values in an array.

The current FD code generator creates an assignment statement for each value. Don't know why it takes LLVM so long to compile code like this. But it wouldn't be that hard to detect long contiguous sequences of these values and write them more like this:

vals = [1.0,2.0,3.0,.....]

I did a simple test and found LLVM can process a 100,000 element array like this in .5 seconds on my laptop.

Unfortunately either the Julia compiler or LLVM seems to have a length limit that prevents compiling expressions in the range of 1e6 array elements.

But this would make it possible to scale to N=100, n=40 (number of non-zeros in jacobian = 172040) with FD times still in the seconds and LLVM time perhaps less than 1 second.

And I might be able to break the array definition into pieces somehow so larger expressions could be handled.

If I can do this would this be useful to you? Several other people have posted problems where the Jacobian is constant and large so this might be a generally useful optimization.

This is the symbolic output for N=2,n=2:

julia> test1()
6-element Vector{FastDifferentiation.Node}:
                                                                                                                            (x01 - x1_1)
                                                                                                                            (x02 - x2_1)
 ((((0.11526751354123002 * x1_1) + (0.11328140931571995 * x2_1)) + ((-0.8473378533858925 * u1_1) + (0.3686082615748242 * u2_1))) - x1_2)
 ((((0.018098192854194988 * x1_1) + (-0.1514665274098452 * x2_1)) + ((-0.4202610942649159 * u1_1) + (1.160902750563023 * u2_1))) - x2_2)
 ((((0.11526751354123002 * x1_2) + (0.11328140931571995 * x2_2)) + ((-0.8473378533858925 * u1_2) + (0.3686082615748242 * u2_2))) - x1_3)
 ((((0.018098192854194988 * x1_2) + (-0.1514665274098452 * x2_2)) + ((-0.4202610942649159 * u1_2) + (1.160902750563023 * u2_2))) - x2_3)

The jacobian of this function with respect to (x,u) is constant.

6×10 Matrix{FastDifferentiation.Node}:
                  -1                    0.0                  0.0                    0.0  0.0  0.0  …                  0.0                  0.0                  0.0
                 0.0                     -1                  0.0                    0.0  0.0  0.0                     0.0                  0.0                  0.0
 -0.1484629505332367  -0.039882977729197897                   -1                    0.0  0.0  0.0       1.226063017934781                  0.0                  0.0
 0.07205232591435677     0.1767138131541932                  0.0                     -1  0.0  0.0     0.08761872050648517                  0.0                  0.0
                 0.0                    0.0  -0.1484629505332367  -0.039882977729197897   -1  0.0                     0.0   0.7512839281234882    1.226063017934781
                 0.0                    0.0  0.07205232591435677     0.1767138131541932  0.0   -1  …                  0.0  -0.3737496248908396  0.08761872050648517

Is this what you intended? I tried this for several values of N,n, and all the jacobians are constant so all the runtime generated function does is fill in constant values in the jacobian array. Was this your intention or is there something missing from the functions?

brianguenter commented 1 year ago

The tensor differentiation code ideas I have would be much cleaner but those are months away. Whereas changing the code generation to handle constant array initialization could be a few days to a week.

baggepinnen commented 1 year ago

My example here used a linear dynamics model (linear MPC), in which case the Jacobian is indeed constant, given by this manually coded Jacobian

function manual_jacobian(N, Ad, Bd)
    (nx, nu) = size(Bd)
    Ax = kron(speye(N + 1), -speye(nx)) + kron(spdiagm(-1 => ones(N)), Ad)
    Bu = kron([spzeros(1, N); speye(N)], Bd)
    Aeq = [Ax Bu]
end

Linear MPC is already a very useful problem to solve, but for nonlinear MPC (which is even more interesting), the Jacobian is not constant. It will, however, typically have the same sparsity structure.

If I can do this would this be useful to you?

That would certainly be very useful. Even for nonlinear MPC where the jacobian is non-constant, a significant part of the jacobian is still likely to be constant, so such an optimization could likely be useful even in this case. For example, the -1 diagonal would be constant also for the nonlinear case.

brianguenter commented 1 year ago

I'm going to convert this to a discussion item. I don't have time to address this problem right now but I also don't want it to be buried in closed issues.

brianguenter / FastDifferentiation.jl

Poor scaling for linear MPC example #26