Open PierreMartinon opened 3 months ago
Reminder: keep track of what has been tried in the performance Discussion
@PierreMartinon should more or less be subsumed by https://github.com/control-toolbox/CTDirect.jl/issues/188
Ha, you wish :D In the current state, most of the remaining allocations seem to be for the variables passed to the functions, ie t,x,u,v. These are basically the same for in_place / out_of_place, and are surprisingly difficult to get rid of. Trying to reuse pre-allocated vectors (with out_of_place or in_place getters) led to more allocations in practice, which was a bit confusing. Currently the getters return views of vectors, and the scalar conversion is done if needed when calling the OCP functions, ie
ocp.dynamics((@view f[1:dim_OCP_x]), t, _x(x), _u(u), v)
with
_x(x::AbstractVector) = (dim_OCP_x == 1) ? x[1] : x[1:dim_OCP_x]
_u(u::AbstractVector) = (dim_NLP_u == 1) ? u[1] : u
An annoying point is that, probably due to the very small size of our vectors, there is no consistent performance gain when using dot operators and/or views.
Right now both versions have similar allocations, and in_place seems a bit faster. Still working on it.
@PierreMartinon When you pass sth like x[1:dim_OCP_x]
you make a copy of the argument; a view must be passed to avoid this... or more simply x
itself (when you pass not a slice but the full vector - no need for a view, then). Check:
julia> f!(y, x) = begin
y[:] .= [sum(cos.(x[1:2])), x[1] * x[3]]
end
f! (generic function with 1 method)
julia> let x = ones(3)
y = zeros(4)
println(@allocated f!(y[1:2], x))
end
320
julia> let x = ones(3)
y = zeros(4)
println(@allocated f!(@view(y[1:2]), x))
end
240
julia> let x = ones(3)
y = zeros(4)
println(@allocated f!(y[1:2], x[1:3]))
end
400
julia> let x = ones(3)
y = zeros(4)
println(@allocated f!(@view(y[1:2]), x[1:3]))
end
320
julia> let x = ones(3)
y = zeros(4)
println(@allocated f!(@view(y[1:2]), @view(x[1:3])))
end
240
Also note that more allocations can be saved using @views y[:] .= ...
inside the in place function (see https://github.com/control-toolbox/CTBase.jl/pull/271#issuecomment-2336787273); this is what is systematically done in the code generated from the abstract form:
julia> f!(y, x) = begin
@views y[:] .= [sum(cos.(x[1:2])), x[1] * x[3]]
end
f! (generic function with 1 method)
julia> let x = ones(3)
y = zeros(4)
println(@allocated f!(y[1:2], x))
end
240
julia> let x = ones(3)
y = zeros(4)
println(@allocated f!(@view(y[1:2]), x))
end
160
julia> let x = ones(3)
y = zeros(4)
println(@allocated f!(y[1:2], x[1:3]))
end
320
julia> let x = ones(3)
y = zeros(4)
println(@allocated f!(@view(y[1:2]), x[1:3]))
end
240
julia> let x = ones(3)
y = zeros(4)
println(@allocated f!(@view(y[1:2]), @view(x[1:3])))
end
160
So as rule, the good combination seems to be:
julia> f!(y, x) = begin
@views y[:] .= [sum(cos.(x[1:2])), x[1] * x[3]]
end
f! (generic function with 1 method)
julia> let x = ones(3)
y = zeros(4)
println(@allocated f!(@view(y[1:2]), x))
end
160
Moreover, you should be able to write a code that is uniform and works both for vectors and scalars.
@amontoison any further comments on the post above?
@PierreMartinon comments on https://github.com/control-toolbox/CTDirect.jl/issues/169#issuecomment-2350620369 above?
In the example x was already a view, and adding another view in _x gives slightly worse results.
For very small sizes, views are just not better than slices, due to the cost of creating the view (which is not just a pointer).
Regarding types, we still have some unstable ones due to the unions from CTBase (mayer for instance is now a Union{Nothing,Mayer,Mayer!}), but it's hard to tell the actual impact of these.
Same with the scalar / vector case, I ended up having similar performance using either the 'converters' (_x, _u) or getters that directly return scalar or vector values.
I'll probably merge the current inplace branch once it's cleaned up a bit. Performance is slightly better than current main.
I think we can still improve the allocations in the constraints: compared to the best_allocs_trapeze
branch, introducing the midpoint method and the inplace case added some overhead (current code accepts both inplace and outplace(tm)).
We can probably use work arrays better too. Started to look at Profile.Allocs with PProf, flamegraphs are nice, however the 'source' view is not always accurate. More handy than the .mem files though.
@PierreMartinon ready to merge and test inplace code? (sorry if I missed sth!)
Hi @jbcaillau ! Almost. The branch is similar in performance to main so merging is not a regression. Still a few tests to do for the dynamics part.
In CTBase could we have the option to enable/disable inplace for abstract OCPs ?
Hi
Hi @jbcaillau ! Almost. The branch is similar in performance to main so merging is not a regression. Still a few tests to do for the dynamics part.
well, performance is supposed to be much better by reducing the number of allocations!
In CTBase could we have the option to enable/disable inplace for abstract OCPs ?
it is planned to keep only the memory efficient code, that is inplace alone. what would be te point to keep out of place?
Following our discussion this morning, a few details on the current version, more specifically the constraints.
function DOCP_constraints!(c, xu, docp::DOCP)
# initialization
if docp.has_free_t0 || docp.has_free_tf
docp.get_time_grid!(xu)
end
v = docp.get_optim_variable(xu)
work = setWorkArray(docp, xu, docp.NLP_time_grid, v)
# main loop on time steps
for i = 1:docp.dim_NLP_steps+1
setConstraintBlock!(docp, c, xu, v, docp.NLP_time_grid, i, work)
end
# point constraints
setPointConstraints!(docp, c, xu, v)
# NB. the function *needs* to return c for AD...
return c
end
Here is the loop core function, here for midpoint since it is a bit simpler than trapeze (no save/reuse of dynamics). First we have the discretized dynamics
function setConstraintBlock!(docp::DOCP{Trapeze}, c, xu, v, time_grid, i, work)
disc = docp.discretization
# offset for previous steps
offset = (i-1)*(docp.dim_NLP_x + docp.dim_path_cons)
# 0. variables
ti = time_grid[i]
xi = get_state_at_time_step(xu, docp, i)
ui = get_control_at_time_step(xu, docp, i)
#1. state equation
if i <= docp.dim_NLP_steps
# more variables
fi = copy(work) # create new copy, not just a reference
tip1 = time_grid[i+1]
xip1 = get_state_at_time_step(xu, docp, i+1)
uip1 = get_control_at_time_step(xu, docp, i+1)
docp.dynamics_ext!(work, tip1, xip1, uip1, v)
# trapeze rule with 'smart' update for dynamics (similar with @.)
c[offset+1:offset+docp.dim_NLP_x] = xip1 - (xi + 0.5 * (tip1 - ti) * (fi + work)) #+++ jet runtime dispatch here even with explicit index ranges
offset += docp.dim_NLP_x
end
followed by the path constraints (it is the same function, I split the quote for clarity).*
# 2. path constraints
# Notes on allocations:.= seems similar
if docp.dim_u_cons > 0
if docp.has_inplace
docp.control_constraints[2]((@view c[offset+1:offset+docp.dim_u_cons]),ti, docp._u(ui), v)
else
c[offset+1:offset+docp.dim_u_cons] = docp.control_constraints[2](ti, docp._u(ui), v)
end
end
if docp.dim_x_cons > 0
if docp.has_inplace
docp.state_constraints[2]((@view c[offset+docp.dim_u_cons+1:offset+docp.dim_u_cons+docp.dim_x_cons]),ti, docp._x(xi), v)
else
c[offset+docp.dim_u_cons+1:offset+docp.dim_u_cons+docp.dim_x_cons] = docp.state_constraints[2](ti, docp._x(xi), v)
end
end
if docp.dim_mixed_cons > 0
if docp.has_inplace
docp.mixed_constraints[2]((@view c[offset+docp.dim_u_cons+docp.dim_x_cons+1:offset+docp.dim_u_cons+docp.dim_x_cons+docp.dim_mixed_cons]), ti, docp._x(xi), docp._u(ui), v)
else
c[offset+docp.dim_u_cons+docp.dim_x_cons+1:offset+docp.dim_u_cons+docp.dim_x_cons+docp.dim_mixed_cons] = docp.mixed_constraints[2](ti, docp._x(xi), docp._u(ui), v)
end
end
end
*Here you can see it still accepts the out of place format for comparison purposes. This is also true for the dynamics part, although the encapsulating function dynamics_ext! is always inplace.
The getter for x (u is similar)
function get_state_at_time_step(xu, docp::DOCP{Trapeze}, i)
nx = docp.dim_NLP_x
m = docp.dim_NLP_u
offset = (nx + m) * (i-1)
return @view xu[(offset + 1):(offset + nx)]
end
JET gives a number of runtime dispatch alerts, unfortunately the output is not comfortable to navigate.
The getters for x's and u's seem to allocate 80 bytes for the views (a view with constant indices is 48bytes), regardless of the size. NB. @jbcaillau the size 1 view is still a subarray and is not necessarily compatible with scalar syntax in OCP functions, did I miss something ?
Copying values for slices takes 96, 112, 112, 128, 128, etc (16 bytes alignment), so Views seem always better in terms of allocations.
For whatever reason, when testing the constraints evaluation, we see a bit less allocations for the slice version (299kB vs 308kB for views) but performance is similar. The main drawback of views is that they are allocated at each getter call, while the slice version may work inplace with a reusable work array ? I'll try the variant without getters at all first to check what could be gained.
A view
should not allocate any bytes. Something looks wrong 🤔
Hi @amontoison , Testing the getters separately with btime did show 0 allocations, I'll stop using the .mem files. The source view from the profiler is more interesting:
setConstraintBlock!(::CTDirect.DOCP{CTDirect.Trapeze}, ::Vector{Float64}, ::Vector{Float64}, ::Float64, ::Vector{Any}, ::Int64, ::Vector{Float64})
/home/pierre/CTDirect.jl/src/trapeze.jl
Total: 0 7298 (flat, cum) 92.90%
89 . . ui = get_control_at_time_step(xu, docp, i)
90 . .
91 . . #1. state equation
92 . . if i <= docp.dim_NLP_steps
93 . . # more variables
94 . 100 fi = copy(work) # create new copy, not just a reference
95 . . tip1 = time_grid[i+1]
96 . . xip1 = get_state_at_time_step(xu, docp, i+1)
97 . . uip1 = get_control_at_time_step(xu, docp, i+1)
98 . 1300 docp.dynamics_ext!(work, tip1, xip1, uip1, v)
99 . .
100 . . # trapeze rule with 'smart' update for dynamics (similar with @.)
101 . 800 c[offset+1:offset+docp.dim_NLP_x] = xip1 - (xi + 0.5 * (tip1 - ti) * (fi + work)) #+++ jet runtime dispatch here even with explicit index ranges
102 . . offset += docp.dim_NLP_x
103 . . end
104 . .
105 . . # 2. path constraints
106 . . # Notes on allocations:.= seems similar
107 . . if docp.dim_u_cons > 0
108 . . if docp.has_inplace
109 . . docp.control_constraints[2]((@view c[offset+1:offset+docp.dim_u_cons]),ti, docp._u(ui), v)
110 . . else
111 . 1632 c[offset+1:offset+docp.dim_u_cons] = docp.control_constraints[2](ti, docp._u(ui), v)
112 . . end
113 . . end
114 . . if docp.dim_x_cons > 0
115 . . if docp.has_inplace
116 . . docp.state_constraints[2]((@view c[offset+docp.dim_u_cons+1:offset+docp.dim_u_cons+docp.dim_x_cons]),ti, docp._x(xi), v)
117 . . else
118 . 1531 c[offset+docp.dim_u_cons+1:offset+docp.dim_u_cons+docp.dim_x_cons] = docp.state_constraints[2](ti, docp._x(xi), v)
119 . . end
120 . . end
121 . . if docp.dim_mixed_cons > 0
122 . . if docp.has_inplace
123 . . docp.mixed_constraints[2]((@view c[offset+docp.dim_u_cons+docp.dim_x_cons+1:offset+docp.dim_u_cons+docp.dim_x_cons+docp.dim_mixed_cons]), ti, docp._x(xi), docp._u(ui), v)
124 . . else
125 . 1935 c[offset+docp.dim_u_cons+docp.dim_x_cons+1:offset+docp.dim_u_cons+docp.dim_x_cons+docp.dim_mixed_cons] = docp.mixed_constraints[2](ti, docp._x(xi), docp._u(ui), v)
126 . . end
127 . . end
128 . .
129 . . end
Inplace version, the constraints still allocate apparently:
107 . . if docp.has_inplace
108 . 1515 docp.control_constraints[2]((@view c[offset+1:offset+docp.dim_u_cons]),ti, docp._u(ui), v)
109 . . else
110 . . c[offset+1:offset+docp.dim_u_cons] = docp.control_constraints[2](ti, docp._u(ui), v)
111 . . end
112 . . end
113 . . if docp.dim_x_cons > 0
114 . . if docp.has_inplace
115 . 1313 docp.state_constraints[2]((@view c[offset+docp.dim_u_cons+1:offset+docp.dim_u_cons+docp.dim_x_cons]),ti, docp._x(xi), v)
116 . . else
117 . . c[offset+docp.dim_u_cons+1:offset+docp.dim_u_cons+docp.dim_x_cons] = docp.state_constraints[2](ti, docp._x(xi), v)
118 . . end
119 . . end
120 . . if docp.dim_mixed_cons > 0
121 . . if docp.has_inplace
122 . 1717 docp.mixed_constraints[2]((@view c[offset+docp.dim_u_cons+docp.dim_x_cons+1:offset+docp.dim_u_cons+docp.dim_x_cons+docp.dim_mixed_cons]), ti, docp._x(xi), docp._u(ui), v)
123 . . else
The last one goes to
207 . 101 function ψ!(val, t, x, u, v) # nonlinear mixed constraints (in place)
208 . . j = 1
209 . 101 for i ∈ 1:length(ψf)
210 . . li = ψs[i]
211 . 909 ψf[i](@view(val[j:(j + li - 1)]), t, x, u, v)
212 . . j = j + li
213 . . end
214 . . return nothing
215 . . end
I defined the constraint for the ocp as
constraint!(goddard, :mixed, f = m_fun!, lb = mf, ub = Inf)
with
function m_fun!(c, x, u, v)
@views c[:] .= x[3]
return
end
(basic c[1] = x[3] is similar)
Did you try @code_warntype
? I suspect a type instability.
I think so too. Warntype was mostly ok on the constraints but it did not descend into the subfunctions. JET gave runtime dispatch at all OCP function calls, but I find the output harder to read. I'll try warntype on the subfunctions, thanks !
Some progress. Fixed a type instability on the times, and got rid of the scalar/vector Union for the state/control/optim variables. Reduction of allocations is not as big as hoped though, still looking into it.
Update: getters for t,x,u,v are now type-stable and show 0 allocations. For the times we split the Union and use 2 distinct variables for fixed time vs free time index. For the scalar/vector x,u,v we now have 3 parametric types in DOCP and the getters dispatch along these types. Each getter only has 2 methods (scalar / vector) for its own variable, regardless of the others.
The vast majority of the remaining allocations occurs when calling OCP functions: mayer, dynamics, lagrange, and all constraints. JET indicates runtime dispatch at all these calls, and code_warntype flags them as 'Any'. So here we are :D
I'll start with mayer.
@PierreMartinon thanks for the update (and @amontoison for the feedback):
@PierreMartinon thanks for the update (and @amontoison for the feedback):
* good to see using views does not allocate (!) * so did splitting unions and using dispatch suppressed type instabilities?
Yes it certainly seems so. Older code used single methods with if
statements and possibly different/ambiguous return types. Current code uses
initial.time
and final.time
abstract type ScalVect end
struct ScalVariable <: ScalVect end
struct VectVariable <: ScalVect end
...
struct DOCP{T <: Discretization, X <: ScalVect, U <: ScalVect, V <: ScalVect}
...
function get_OCP_state_at_time_step(xu, docp::DOCP{Trapeze, ScalVariable, <: ScalVect, <: ScalVect}, i)
offset = (i-1) * (docp.dim_NLP_x + docp.dim_NLP_u)
return xu[offset+1]
end
function get_OCP_state_at_time_step(xu, docp::DOCP{Trapeze, VectVariable, <: ScalVect, <: ScalVect}, i)
offset = (i-1) * (docp.dim_NLP_x + docp.dim_NLP_u)
return @view xu[(offset + 1):(offset + docp.dim_OCP_x)]
end
* what about the ocp primitives you mentioned (mayer and so)? * at the moment, we have union of vectors and scalars everywhere in ocp...
See below. Maybe we'll add some post-processing in the OCP to define functions with more fixed types ?
Looking at the mayer cost which is one of the simpler calls, and also the main part of the objective. 'Local Mayer' is the (inplace) locally defined version
function local_mayer(obj, x0, xf, v)
obj[1] = xf[3]
return
end
while 'OCP Mayer' refers to docp.ocp.mayer
from CTBase. The call is done as follows (with $ for btime calls)
xu = CTDirect.DOCP_initial_guess(docp)
x0 = CTDirect.get_OCP_state_at_time_step(xu, docp, 1)
xf = CTDirect.get_OCP_state_at_time_step(xu, docp, N+1)
v = CTDirect.get_OCP_variable(xu, docp)
obj = similar(xu,1)
docp.ocp.mayer(obj, x0, xf, v)
We see that the local Mayer does not allocate, either with hardcoded views / scalar arguments or the new getters (while old getters added some allocations there). However, the OCP Mayer has some allocations, which I'm trying to track.
julia> test_unit(;test_mayer=true, warntype=true, jet=true, profile=true)
Local Mayer: views for x0/xf and scalar v 4.474 ns (0 allocations: 0 bytes)
Local Mayer: param scal/vec getters 2.467 ns (0 allocations: 0 bytes)
OCP Mayer: param scal/vec getters 28.047 ns (3 allocations: 112 bytes)
Code_warntype looks clean except for the F.f! function that is tagged as 'Any'.
MethodInstance for (::Mayer!{NonFixed})(::Vector{Float64}, ::SubArray{Float64, 1, Vector{Float64}, Tuple{UnitRange{Int64}}, true}, ::SubArray{Float64, 1, Vector{Float64}, Tuple{UnitRange{Int64}}, true}, ::Float64)
from (F::Mayer!{NonFixed})(r::Union{Real, AbstractVector{<:Real}}, x0::Union{Real, AbstractVector{<:Real}}, xf::Union{Real, AbstractVector{<:Real}}, v::Union{Real, AbstractVector{<:Real}}) @ CTBase ~/.julia/packages/CTBase/0TgT4/src/functions.jl:179
Arguments
F::Mayer!{NonFixed}
r::Vector{Float64}
x0::SubArray{Float64, 1, Vector{Float64}, Tuple{UnitRange{Int64}}, true}
xf::SubArray{Float64, 1, Vector{Float64}, Tuple{UnitRange{Int64}}, true}
v::Float64
Locals
@_6::Any
Body::Nothing
1 ─ %1 = CTBase.Nothing::Core.Const(Nothing)
│ %2 = Base.getproperty(F, :f!)::Function
│ %3 = (%2)(r, x0, xf, v)::Any
│ (@_6 = %3)
│ %5 = (@_6 isa %1)::Bool
└── goto #3 if not %5
2 ─ goto #4
3 ─ %8 = Base.convert(%1, @_6)::Core.Const(nothing)
└── (@_6 = Core.typeassert(%8, %1))
4 ┄ return @_6::Nothing
JET indicates 2 runtime dispatches for the call to F.f!, the second one seems related to the return type ?
═════ 2 possible errors found ═════
┌ (::Mayer!{…})(r::Vector{…}, x0::SubArray{…}, xf::SubArray{…}, v::Float64) @ CTBase /home/pierre/.julia/packages/CTBase/0TgT4/src/functions.jl:180
│ runtime dispatch detected: %1::Function(r::Vector{Float64}, x0::SubArray{Float64, 1, Vector{Float64}, Tuple{UnitRange{Int64}}, true}, xf::SubArray{Float64, 1, Vector{Float64}, Tuple{UnitRange{Int64}}, true}, v::Float64)::Any
└────────────────────
┌ (::Mayer!{…})(r::Vector{…}, x0::SubArray{…}, xf::SubArray{…}, v::Float64) @ CTBase /home/pierre/.julia/packages/CTBase/0TgT4/src/functions.jl:180
│ runtime dispatch detected: convert(CTBase.Nothing, %2::Any)::Nothing
└────────────────────
Profile locates the 3 allocations at the f! call, and below that the trace looks very low level.
(::Mayer!{NonFixed})(::Vector{Float64}, ::SubArray{Float64, 1, Vector{Float64}, Tuple{UnitRange{Int64}}, true}, ::SubArray{Float64, 1, Vector{Float64}, Tuple{UnitRange{Int64}}, true}, ::Float64)
/home/pierre/.julia/packages/CTBase/0TgT4/src/functions.jl
179 . . function (F::Mayer!{NonFixed})(r::ctVector, x0::State, xf::State, v::Variable)::Nothing
180 . 3 return F.f!(r, x0, xf, v)
181 . . end
@jbcaillau @ocots Question: State
is currently an union of scalar / vector, would it be possible, when building the ocp, to define State
only after reading the state dimension, so it would be either scalar or vector, but not both ? Iirc the dimensions of the problems are supposed to be declared before the functions when using the def
macro. Same for Control
and Variable
.
The vast majority of the remaining allocations occurs when calling OCP functions: mayer, dynamics, lagrange, and all constraints. JET indicates runtime dispatch at all these calls, and code_warntype flags them as 'Any'. So here we are :D
This confirms what we saw yesterday with @ocots. Your main problem everywhere (here and in CTBase) is fields with abstract types.
For more performance tips, as well as a faster way of profiling (@profview
and @profview_allocs
), check out my Julia & Optimization Days tutorial: https://gdalle.github.io/JuliaOptimizationDays2024-FastJulia/
Todo: check the .mem files thingy (julia --track-allocation=user). Also recheck code_warntype for type unstabilities.