Redo a pass on memory allocations

PierreMartinon commented 3 months ago

Todo: check the .mem files thingy (julia --track-allocation=user). Also recheck code_warntype for type unstabilities.

PierreMartinon commented 2 months ago

Reminder: keep track of what has been tried in the performance Discussion

jbcaillau commented 1 month ago

@PierreMartinon should more or less be subsumed by https://github.com/control-toolbox/CTDirect.jl/issues/188

PierreMartinon commented 1 month ago

Ha, you wish :D In the current state, most of the remaining allocations seem to be for the variables passed to the functions, ie t,x,u,v. These are basically the same for in_place / out_of_place, and are surprisingly difficult to get rid of. Trying to reuse pre-allocated vectors (with out_of_place or in_place getters) led to more allocations in practice, which was a bit confusing. Currently the getters return views of vectors, and the scalar conversion is done if needed when calling the OCP functions, ie

ocp.dynamics((@view f[1:dim_OCP_x]), t, _x(x), _u(u), v)

with

_x(x::AbstractVector) = (dim_OCP_x == 1) ? x[1] : x[1:dim_OCP_x]
_u(u::AbstractVector) = (dim_NLP_u == 1) ? u[1] : u

An annoying point is that, probably due to the very small size of our vectors, there is no consistent performance gain when using dot operators and/or views.

Right now both versions have similar allocations, and in_place seems a bit faster. Still working on it.

jbcaillau commented 1 month ago

@PierreMartinon When you pass sth like x[1:dim_OCP_x] you make a copy of the argument; a view must be passed to avoid this... or more simply x itself (when you pass not a slice but the full vector - no need for a view, then). Check:

julia> f!(y, x) = begin
       y[:] .= [sum(cos.(x[1:2])), x[1] * x[3]]
       end
f! (generic function with 1 method)

julia> let x = ones(3)
       y = zeros(4)
       println(@allocated f!(y[1:2], x))
       end
320

julia> let x = ones(3)
       y = zeros(4)
       println(@allocated f!(@view(y[1:2]), x))
       end
240

julia> let x = ones(3)
       y = zeros(4)
       println(@allocated f!(y[1:2], x[1:3]))
       end
400

julia> let x = ones(3)
       y = zeros(4)
       println(@allocated f!(@view(y[1:2]), x[1:3]))
       end
320

julia> let x = ones(3)
       y = zeros(4)
       println(@allocated f!(@view(y[1:2]), @view(x[1:3])))
       end
240

Also note that more allocations can be saved using @views y[:] .= ... inside the in place function (see https://github.com/control-toolbox/CTBase.jl/pull/271#issuecomment-2336787273); this is what is systematically done in the code generated from the abstract form:

julia> f!(y, x) = begin
       @views y[:] .= [sum(cos.(x[1:2])), x[1] * x[3]]
       end
f! (generic function with 1 method)

julia> let x = ones(3)
       y = zeros(4)
       println(@allocated f!(y[1:2], x))
       end
240

julia> let x = ones(3)
       y = zeros(4)
       println(@allocated f!(@view(y[1:2]), x))
       end
160

julia> let x = ones(3)
       y = zeros(4)
       println(@allocated f!(y[1:2], x[1:3]))
       end
320

julia> let x = ones(3)
       y = zeros(4)
       println(@allocated f!(@view(y[1:2]), x[1:3]))
       end
240

julia> let x = ones(3)
       y = zeros(4)
       println(@allocated f!(@view(y[1:2]), @view(x[1:3])))
       end
160

So as rule, the good combination seems to be:

julia> f!(y, x) = begin
       @views y[:] .= [sum(cos.(x[1:2])), x[1] * x[3]]
       end
f! (generic function with 1 method)

julia> let x = ones(3)
       y = zeros(4)
       println(@allocated f!(@view(y[1:2]), x))
       end
160

Moreover, you should be able to write a code that is uniform and works both for vectors and scalars.

jbcaillau commented 1 month ago

@amontoison any further comments on the post above?

jbcaillau commented 1 month ago

@PierreMartinon comments on https://github.com/control-toolbox/CTDirect.jl/issues/169#issuecomment-2350620369 above?

PierreMartinon commented 1 month ago

In the example x was already a view, and adding another view in _x gives slightly worse results.

For very small sizes, views are just not better than slices, due to the cost of creating the view (which is not just a pointer).

Regarding types, we still have some unstable ones due to the unions from CTBase (mayer for instance is now a Union{Nothing,Mayer,Mayer!}), but it's hard to tell the actual impact of these.

Same with the scalar / vector case, I ended up having similar performance using either the 'converters' (_x, _u) or getters that directly return scalar or vector values.

I'll probably merge the current inplace branch once it's cleaned up a bit. Performance is slightly better than current main.

PierreMartinon commented 1 month ago

I think we can still improve the allocations in the constraints: compared to the best_allocs_trapeze branch, introducing the midpoint method and the inplace case added some overhead (current code accepts both inplace and outplace(tm)).

We can probably use work arrays better too. Started to look at Profile.Allocs with PProf, flamegraphs are nice, however the 'source' view is not always accurate. More handy than the .mem files though.

jbcaillau commented 1 month ago

@PierreMartinon ready to merge and test inplace code? (sorry if I missed sth!)

PierreMartinon commented 1 month ago

Hi @jbcaillau ! Almost. The branch is similar in performance to main so merging is not a regression. Still a few tests to do for the dynamics part.

In CTBase could we have the option to enable/disable inplace for abstract OCPs ?

jbcaillau commented 1 month ago

Hi

Hi @jbcaillau ! Almost. The branch is similar in performance to main so merging is not a regression. Still a few tests to do for the dynamics part.

well, performance is supposed to be much better by reducing the number of allocations!

In CTBase could we have the option to enable/disable inplace for abstract OCPs ?

it is planned to keep only the memory efficient code, that is inplace alone. what would be te point to keep out of place?

PierreMartinon commented 1 month ago

Following our discussion this morning, a few details on the current version, more specifically the constraints.

function DOCP_constraints!(c, xu, docp::DOCP)

    # initialization
    if docp.has_free_t0 || docp.has_free_tf
        docp.get_time_grid!(xu)
    end
    v = docp.get_optim_variable(xu)
    work = setWorkArray(docp, xu, docp.NLP_time_grid, v)

    # main loop on time steps 
    for i = 1:docp.dim_NLP_steps+1
        setConstraintBlock!(docp, c, xu, v, docp.NLP_time_grid, i, work)
    end

    # point constraints
    setPointConstraints!(docp, c, xu, v)

    # NB. the function *needs* to return c for AD...
    return c
end

Here is the loop core function, here for midpoint since it is a bit simpler than trapeze (no save/reuse of dynamics). First we have the discretized dynamics

function setConstraintBlock!(docp::DOCP{Trapeze}, c, xu, v, time_grid, i, work)

    disc = docp.discretization

    # offset for previous steps
    offset = (i-1)*(docp.dim_NLP_x + docp.dim_path_cons)

    # 0. variables
    ti = time_grid[i]
    xi = get_state_at_time_step(xu, docp, i)
    ui = get_control_at_time_step(xu, docp, i)

    #1. state equation
    if i <= docp.dim_NLP_steps
        # more variables
        fi = copy(work) # create new copy, not just a reference
        tip1 = time_grid[i+1]
        xip1 = get_state_at_time_step(xu, docp, i+1)
        uip1 = get_control_at_time_step(xu, docp, i+1)
        docp.dynamics_ext!(work, tip1, xip1, uip1, v)

        # trapeze rule with 'smart' update for dynamics (similar with @.)
        c[offset+1:offset+docp.dim_NLP_x] = xip1 - (xi + 0.5 * (tip1 - ti) * (fi + work)) #+++ jet runtime dispatch here even with explicit index ranges
        offset += docp.dim_NLP_x
    end

followed by the path constraints (it is the same function, I split the quote for clarity).*


    # 2. path constraints
    # Notes on allocations:.= seems similar
    if docp.dim_u_cons > 0
        if docp.has_inplace
            docp.control_constraints[2]((@view c[offset+1:offset+docp.dim_u_cons]),ti, docp._u(ui), v)
        else
            c[offset+1:offset+docp.dim_u_cons] = docp.control_constraints[2](ti, docp._u(ui), v)
        end
    end
    if docp.dim_x_cons > 0 
        if docp.has_inplace
            docp.state_constraints[2]((@view c[offset+docp.dim_u_cons+1:offset+docp.dim_u_cons+docp.dim_x_cons]),ti, docp._x(xi), v)
        else
            c[offset+docp.dim_u_cons+1:offset+docp.dim_u_cons+docp.dim_x_cons] = docp.state_constraints[2](ti, docp._x(xi), v)
        end
    end
    if docp.dim_mixed_cons > 0 
        if docp.has_inplace
            docp.mixed_constraints[2]((@view c[offset+docp.dim_u_cons+docp.dim_x_cons+1:offset+docp.dim_u_cons+docp.dim_x_cons+docp.dim_mixed_cons]), ti, docp._x(xi), docp._u(ui), v)
        else
            c[offset+docp.dim_u_cons+docp.dim_x_cons+1:offset+docp.dim_u_cons+docp.dim_x_cons+docp.dim_mixed_cons] = docp.mixed_constraints[2](ti, docp._x(xi), docp._u(ui), v)
        end
    end

end

*Here you can see it still accepts the out of place format for comparison purposes. This is also true for the dynamics part, although the encapsulating function dynamics_ext! is always inplace.

The getter for x (u is similar)

function get_state_at_time_step(xu, docp::DOCP{Trapeze}, i)
    nx = docp.dim_NLP_x
    m = docp.dim_NLP_u
    offset = (nx + m) * (i-1)
    return @view xu[(offset + 1):(offset + nx)]
end

JET gives a number of runtime dispatch alerts, unfortunately the output is not comfortable to navigate.

PierreMartinon commented 1 month ago

The getters for x's and u's seem to allocate 80 bytes for the views (a view with constant indices is 48bytes), regardless of the size. NB. @jbcaillau the size 1 view is still a subarray and is not necessarily compatible with scalar syntax in OCP functions, did I miss something ?

Copying values for slices takes 96, 112, 112, 128, 128, etc (16 bytes alignment), so Views seem always better in terms of allocations.

For whatever reason, when testing the constraints evaluation, we see a bit less allocations for the slice version (299kB vs 308kB for views) but performance is similar. The main drawback of views is that they are allocated at each getter call, while the slice version may work inplace with a reusable work array ? I'll try the variant without getters at all first to check what could be gained.

amontoison commented 1 month ago

A view should not allocate any bytes. Something looks wrong 🤔

PierreMartinon commented 1 month ago

Hi @amontoison , Testing the getters separately with btime did show 0 allocations, I'll stop using the .mem files. The source view from the profiler is more interesting:

setConstraintBlock!(::CTDirect.DOCP{CTDirect.Trapeze}, ::Vector{Float64}, ::Vector{Float64}, ::Float64, ::Vector{Any}, ::Int64, ::Vector{Float64})

/home/pierre/CTDirect.jl/src/trapeze.jl

  Total:           0       7298 (flat, cum) 92.90%
     89            .          .               ui = get_control_at_time_step(xu, docp, i) 
     90            .          .            
     91            .          .               #1. state equation 
     92            .          .               if i <= docp.dim_NLP_steps 
     93            .          .                   # more variables 
     94            .        100                   fi = copy(work) # create new copy, not just a reference 
     95            .          .                   tip1 = time_grid[i+1] 
     96            .          .                   xip1 = get_state_at_time_step(xu, docp, i+1) 
     97            .          .                   uip1 = get_control_at_time_step(xu, docp, i+1) 
     98            .       1300                   docp.dynamics_ext!(work, tip1, xip1, uip1, v) 
     99            .          .            
    100            .          .                   # trapeze rule with 'smart' update for dynamics (similar with @.) 
    101            .        800                   c[offset+1:offset+docp.dim_NLP_x] = xip1 - (xi + 0.5 * (tip1 - ti) * (fi + work)) #+++ jet runtime dispatch here even with explicit index ranges 
    102            .          .                   offset += docp.dim_NLP_x 
    103            .          .               end 
    104            .          .            
    105            .          .               # 2. path constraints 
    106            .          .               # Notes on allocations:.= seems similar 
    107            .          .               if docp.dim_u_cons > 0 
    108            .          .                   if docp.has_inplace 
    109            .          .                       docp.control_constraints[2]((@view c[offset+1:offset+docp.dim_u_cons]),ti, docp._u(ui), v) 
    110            .          .                   else 
    111            .       1632                       c[offset+1:offset+docp.dim_u_cons] = docp.control_constraints[2](ti, docp._u(ui), v) 
    112            .          .                   end 
    113            .          .               end 
    114            .          .               if docp.dim_x_cons > 0  
    115            .          .                   if docp.has_inplace 
    116            .          .                       docp.state_constraints[2]((@view c[offset+docp.dim_u_cons+1:offset+docp.dim_u_cons+docp.dim_x_cons]),ti, docp._x(xi), v) 
    117            .          .                   else 
    118            .       1531                       c[offset+docp.dim_u_cons+1:offset+docp.dim_u_cons+docp.dim_x_cons] = docp.state_constraints[2](ti, docp._x(xi), v) 
    119            .          .                   end 
    120            .          .               end 
    121            .          .               if docp.dim_mixed_cons > 0  
    122            .          .                   if docp.has_inplace 
    123            .          .                       docp.mixed_constraints[2]((@view c[offset+docp.dim_u_cons+docp.dim_x_cons+1:offset+docp.dim_u_cons+docp.dim_x_cons+docp.dim_mixed_cons]), ti, docp._x(xi), docp._u(ui), v) 
    124            .          .                   else 
    125            .       1935                       c[offset+docp.dim_u_cons+docp.dim_x_cons+1:offset+docp.dim_u_cons+docp.dim_x_cons+docp.dim_mixed_cons] = docp.mixed_constraints[2](ti, docp._x(xi), docp._u(ui), v) 
    126            .          .                   end 
    127            .          .               end 
    128            .          .            
    129            .          .           end

PierreMartinon commented 1 month ago

Inplace version, the constraints still allocate apparently:

    107            .          .                   if docp.has_inplace 
    108            .       1515                       docp.control_constraints[2]((@view c[offset+1:offset+docp.dim_u_cons]),ti, docp._u(ui), v) 
    109            .          .                   else 
    110            .          .                       c[offset+1:offset+docp.dim_u_cons] = docp.control_constraints[2](ti, docp._u(ui), v) 
    111            .          .                   end 
    112            .          .               end 
    113            .          .               if docp.dim_x_cons > 0  
    114            .          .                   if docp.has_inplace 
    115            .       1313                       docp.state_constraints[2]((@view c[offset+docp.dim_u_cons+1:offset+docp.dim_u_cons+docp.dim_x_cons]),ti, docp._x(xi), v) 
    116            .          .                   else 
    117            .          .                       c[offset+docp.dim_u_cons+1:offset+docp.dim_u_cons+docp.dim_x_cons] = docp.state_constraints[2](ti, docp._x(xi), v) 
    118            .          .                   end 
    119            .          .               end 
    120            .          .               if docp.dim_mixed_cons > 0  
    121            .          .                   if docp.has_inplace 
    122            .       1717                       docp.mixed_constraints[2]((@view c[offset+docp.dim_u_cons+docp.dim_x_cons+1:offset+docp.dim_u_cons+docp.dim_x_cons+docp.dim_mixed_cons]), ti, docp._x(xi), docp._u(ui), v) 
    123            .          .                   else

The last one goes to

    207            .        101               function ψ!(val, t, x, u, v) # nonlinear mixed constraints (in place) 
    208            .          .                   j = 1 
    209            .        101                   for i ∈ 1:length(ψf) 
    210            .          .                       li = ψs[i] 
    211            .        909                       ψf[i](@view(val[j:(j + li - 1)]), t, x, u, v) 
    212            .          .                       j = j + li 
    213            .          .                   end 
    214            .          .                   return nothing 
    215            .          .               end

I defined the constraint for the ocp as

constraint!(goddard, :mixed, f = m_fun!, lb = mf, ub = Inf)

with

    function m_fun!(c, x, u, v) 
        @views c[:] .= x[3]
        return
    end

(basic c[1] = x[3] is similar)

amontoison commented 1 month ago

Did you try @code_warntype? I suspect a type instability.

PierreMartinon commented 1 month ago

I think so too. Warntype was mostly ok on the constraints but it did not descend into the subfunctions. JET gave runtime dispatch at all OCP function calls, but I find the output harder to read. I'll try warntype on the subfunctions, thanks !

PierreMartinon commented 1 month ago

Some progress. Fixed a type instability on the times, and got rid of the scalar/vector Union for the state/control/optim variables. Reduction of allocations is not as big as hoped though, still looking into it.

Update: getters for t,x,u,v are now type-stable and show 0 allocations. For the times we split the Union and use 2 distinct variables for fixed time vs free time index. For the scalar/vector x,u,v we now have 3 parametric types in DOCP and the getters dispatch along these types. Each getter only has 2 methods (scalar / vector) for its own variable, regardless of the others.

The vast majority of the remaining allocations occurs when calling OCP functions: mayer, dynamics, lagrange, and all constraints. JET indicates runtime dispatch at all these calls, and code_warntype flags them as 'Any'. So here we are :D

I'll start with mayer.

jbcaillau commented 1 month ago

@PierreMartinon thanks for the update (and @amontoison for the feedback):

good to see using views does not allocate (!)
so did splitting unions and using dispatch suppressed type instabilities?
what about the ocp primitives you mentioned (mayer and so)?
at the moment, we have union of vectors and scalars everywhere in ocp...

PierreMartinon commented 1 month ago

@PierreMartinon thanks for the update (and @amontoison for the feedback):
* good to see using views does not allocate (!)

* so did splitting unions and using dispatch suppressed type instabilities?

Yes it certainly seems so. Older code used single methods with if statements and possibly different/ambiguous return types. Current code uses

two different variables for value/index in the case of initial.time and final.time

for the getters (x,u,v), DOCP is now parametrized by 3 indicators corresponding to scalar / vector state, control and variable. Getters have 2 methods each. There may be better ways to do this, but it seems to do the job. Fortunately there is no combinatorial aspect on the different types: each getter only cares about its own type.

abstract type ScalVect end
struct ScalVariable <: ScalVect end
struct VectVariable <: ScalVect end
...
struct DOCP{T <: Discretization, X <: ScalVect, U <: ScalVect, V <: ScalVect}
...
function get_OCP_state_at_time_step(xu, docp::DOCP{Trapeze, ScalVariable, <: ScalVect, <: ScalVect}, i)
offset = (i-1) * (docp.dim_NLP_x + docp.dim_NLP_u)
return xu[offset+1]
end
function get_OCP_state_at_time_step(xu, docp::DOCP{Trapeze, VectVariable, <: ScalVect, <: ScalVect}, i)
offset = (i-1) * (docp.dim_NLP_x + docp.dim_NLP_u)
return @view xu[(offset + 1):(offset + docp.dim_OCP_x)]
end

* what about the ocp primitives you mentioned (mayer and so)?

* at the moment, we have union of vectors and scalars everywhere in ocp...

See below. Maybe we'll add some post-processing in the OCP to define functions with more fixed types ?

PierreMartinon commented 1 month ago

Looking at the mayer cost which is one of the simpler calls, and also the main part of the objective. 'Local Mayer' is the (inplace) locally defined version

function local_mayer(obj, x0, xf, v)
    obj[1] = xf[3]
    return
end

while 'OCP Mayer' refers to docp.ocp.mayer from CTBase. The call is done as follows (with $ for btime calls)

xu = CTDirect.DOCP_initial_guess(docp)
x0 = CTDirect.get_OCP_state_at_time_step(xu, docp, 1)
xf = CTDirect.get_OCP_state_at_time_step(xu, docp, N+1)
v = CTDirect.get_OCP_variable(xu, docp)
obj = similar(xu,1)
docp.ocp.mayer(obj, x0, xf, v)

We see that the local Mayer does not allocate, either with hardcoded views / scalar arguments or the new getters (while old getters added some allocations there). However, the OCP Mayer has some allocations, which I'm trying to track.

julia> test_unit(;test_mayer=true, warntype=true, jet=true, profile=true)

Local Mayer: views for x0/xf and scalar v  4.474 ns (0 allocations: 0 bytes)
Local Mayer: param scal/vec getters  2.467 ns (0 allocations: 0 bytes)
OCP Mayer: param scal/vec getters  28.047 ns (3 allocations: 112 bytes)

Code_warntype looks clean except for the F.f! function that is tagged as 'Any'.

MethodInstance for (::Mayer!{NonFixed})(::Vector{Float64}, ::SubArray{Float64, 1, Vector{Float64}, Tuple{UnitRange{Int64}}, true}, ::SubArray{Float64, 1, Vector{Float64}, Tuple{UnitRange{Int64}}, true}, ::Float64)
  from (F::Mayer!{NonFixed})(r::Union{Real, AbstractVector{<:Real}}, x0::Union{Real, AbstractVector{<:Real}}, xf::Union{Real, AbstractVector{<:Real}}, v::Union{Real, AbstractVector{<:Real}}) @ CTBase ~/.julia/packages/CTBase/0TgT4/src/functions.jl:179
Arguments
  F::Mayer!{NonFixed}
  r::Vector{Float64}
  x0::SubArray{Float64, 1, Vector{Float64}, Tuple{UnitRange{Int64}}, true}
  xf::SubArray{Float64, 1, Vector{Float64}, Tuple{UnitRange{Int64}}, true}
  v::Float64
Locals
  @_6::Any
Body::Nothing
1 ─ %1 = CTBase.Nothing::Core.Const(Nothing)
│   %2 = Base.getproperty(F, :f!)::Function
│   %3 = (%2)(r, x0, xf, v)::Any
│        (@_6 = %3)
│   %5 = (@_6 isa %1)::Bool
└──      goto #3 if not %5
2 ─      goto #4
3 ─ %8 = Base.convert(%1, @_6)::Core.Const(nothing)
└──      (@_6 = Core.typeassert(%8, %1))
4 ┄      return @_6::Nothing

JET indicates 2 runtime dispatches for the call to F.f!, the second one seems related to the return type ?

═════ 2 possible errors found ═════
┌ (::Mayer!{…})(r::Vector{…}, x0::SubArray{…}, xf::SubArray{…}, v::Float64) @ CTBase /home/pierre/.julia/packages/CTBase/0TgT4/src/functions.jl:180
│ runtime dispatch detected: %1::Function(r::Vector{Float64}, x0::SubArray{Float64, 1, Vector{Float64}, Tuple{UnitRange{Int64}}, true}, xf::SubArray{Float64, 1, Vector{Float64}, Tuple{UnitRange{Int64}}, true}, v::Float64)::Any
└────────────────────
┌ (::Mayer!{…})(r::Vector{…}, x0::SubArray{…}, xf::SubArray{…}, v::Float64) @ CTBase /home/pierre/.julia/packages/CTBase/0TgT4/src/functions.jl:180
│ runtime dispatch detected: convert(CTBase.Nothing, %2::Any)::Nothing
└────────────────────

Profile locates the 3 allocations at the f! call, and below that the trace looks very low level.

(::Mayer!{NonFixed})(::Vector{Float64}, ::SubArray{Float64, 1, Vector{Float64}, Tuple{UnitRange{Int64}}, true}, ::SubArray{Float64, 1, Vector{Float64}, Tuple{UnitRange{Int64}}, true}, ::Float64)

/home/pierre/.julia/packages/CTBase/0TgT4/src/functions.jl

    179            .          .           function (F::Mayer!{NonFixed})(r::ctVector, x0::State, xf::State, v::Variable)::Nothing 
    180            .          3               return F.f!(r, x0, xf, v) 
    181            .          .           end

PierreMartinon commented 1 month ago

@jbcaillau @ocots Question: State is currently an union of scalar / vector, would it be possible, when building the ocp, to define State only after reading the state dimension, so it would be either scalar or vector, but not both ? Iirc the dimensions of the problems are supposed to be declared before the functions when using the def macro. Same for Control and Variable.

gdalle commented 3 days ago

The vast majority of the remaining allocations occurs when calling OCP functions: mayer, dynamics, lagrange, and all constraints. JET indicates runtime dispatch at all these calls, and code_warntype flags them as 'Any'. So here we are :D

This confirms what we saw yesterday with @ocots. Your main problem everywhere (here and in CTBase) is fields with abstract types.

For more performance tips, as well as a faster way of profiling (@profview and @profview_allocs), check out my Julia & Optimization Days tutorial: https://gdalle.github.io/JuliaOptimizationDays2024-FastJulia/

control-toolbox / CTDirect.jl

Redo a pass on memory allocations #169