Open mmesiti opened 10 months ago
(Copied from the last post of the issue in the old repo)
Also, since now julia 1.10 is based on LLVM 17 (IIRC), it should be possible to tell LLVM to try and avoid gather instruction (see relevant thread on julia discourse).
For reference, here's the post
I suppose the only way is to document that users of Intel CPUs should disable gather? I don't know if there can be a foolproof way to do it. Ideally this would be done in LoopVectorization but that seems unlikely since gather might not always be the slower option.
Btw what happens if one removes @turbo
altogether?
Btw what happens if one removes
@turbo
altogether?
We might test that, but I think that the mitigation of Downfall does not affect all vector instructions, only the vgather ones.
@NilsNiggemann I finally have some results for
Run again for S = getCubic(18,[1.]); Par = Params(S,N=50,T=0.4,accuracy = 1e-9), SolveFRG(Par,method = VCABM(),
Running on 1 node, 76 cores:
────────────────────────────────────────────────────────────────────────────────
Time Allocations
─────────────────────── ────────────────────────
Tot / % measured: 1.31h / 100.0% 22.6GiB / 95.8%
Section ncalls time %tot avg alloc %tot avg
────────────────────────────────────────────────────────────────────────────────
total solver 1 1.31h 100.0% 1.31h 21.7GiB 100.0% 21.7GiB
getDeriv! 278 1.02h 77.6% 13.2s 0.99GiB 4.6% 3.65MiB
getXBubble! 278 1.01h 77.3% 13.1s 928MiB 4.2% 3.34MiB
partition 278 1.01h 77.3% 13.1s 927MiB 4.2% 3.33MiB
communication 278 29.2ms 0.0% 105μs 1.19MiB 0.0% 4.38KiB
get_ranges 278 5.82ms 0.0% 20.9μs 300KiB 0.0% 1.08KiB
workspace 278 5.82s 0.1% 20.9ms 0.00B 0.0% 0.00B
addToVertexFro... 278 5.65s 0.1% 20.3ms 12.1MiB 0.1% 44.6KiB
symmetrizeBubble! 278 2.54s 0.1% 9.13ms 24.2MiB 0.1% 89.2KiB
get_Self_Energy! 278 371ms 0.0% 1.34ms 33.4MiB 0.2% 123KiB
symmetrizeVertex! 278 194ms 0.0% 698μs 15.3MiB 0.1% 56.5KiB
getDFint! 278 1.65ms 0.0% 5.94μs 0.00B 0.0% 0.00B
setup 278 841μs 0.0% 3.03μs 0.00B 0.0% 0.00B
bareOutput 136 23.6s 0.5% 173ms 84.9MiB 0.4% 639KiB
get_observables 178 5.14s 0.1% 28.9ms 9.25MiB 0.0% 53.2KiB
────────────────────────────────────────────────────────────────────────────────
Running on 2 nodes, 76 cores each:
────────────────────────────────────────────────────────────────────────────────
Time Allocations
─────────────────────── ────────────────────────
Tot / % measured: 2973s / 99.9% 22.1GiB / 95.7%
Section ncalls time %tot avg alloc %tot avg
────────────────────────────────────────────────────────────────────────────────
total solver 1 2971s 100.0% 2971s 21.2GiB 100.0% 21.2GiB
getDeriv! 278 1921s 64.6% 6.91s 551MiB 2.5% 1.98MiB
getXBubble! 278 1906s 64.2% 6.86s 465MiB 2.1% 1.67MiB
partition 278 1870s 62.9% 6.73s 462MiB 2.1% 1.66MiB
communication 278 36.7s 1.2% 132ms 2.38MiB 0.0% 8.75KiB
get_ranges 278 8.06ms 0.0% 29.0μs 917KiB 0.0% 3.30KiB
addToVertexFro... 278 5.90s 0.2% 21.2ms 12.1MiB 0.1% 44.6KiB
workspace 278 5.62s 0.2% 20.2ms 0.00B 0.0% 0.00B
symmetrizeBubble! 278 2.38s 0.1% 8.56ms 24.2MiB 0.1% 89.2KiB
get_Self_Energy! 278 385ms 0.0% 1.38ms 33.4MiB 0.2% 123KiB
symmetrizeVertex! 278 190ms 0.0% 683μs 15.3MiB 0.1% 56.5KiB
getDFint! 278 1.42ms 0.0% 5.10μs 0.00B 0.0% 0.00B
setup 278 667μs 0.0% 2.40μs 0.00B 0.0% 0.00B
bareOutput 136 24.8s 0.8% 183ms 84.9MiB 0.4% 639KiB
get_observables 178 4.93s 0.2% 27.7ms 9.25MiB 0.0% 53.2KiB
────────────────────────────────────────────────────────────────────────────────
Some considerations:
--nodes=2
) negligible.I will profile this large case to get more understanding.
A clarification on this:
The problem of the MPI approach is that this part cannot scale.
It could scale if we also split the State object between ranks. But still I need to understand what is happening outside our client code and inside solve()
, and why it is so large.
I did try to profile SolveFRG
using @profile
, but there are some additional complications. For example, the ratio of the number of samples collected for getXBubble!
and the number of samples for the whole SolveFRG
does not actually match the ratio of the times measured with TimerOutputs
, but I believe this is due to the fact that the multithreaded part for some reason sits somewhere else in the call tree (I guess then most of the work is will be done under a function like "run task" or something similar).
I expected that for large N
the time spent in getXBubble!
would raise somewhat rapidly, while the time in the solver would rise much more slowly. These are some data for
System = getSquareLattice(NLen, couplings)
[...]
Par = Params( #create a group of all parameters to pass them to the FRG Solver
System, # geometry, this is always required
OneLoop(), # method. OneLoop() is the default
T=0.5, # Temperature for the simulation.
N=N, # Number of positive Matsubara frequencies for the four-point vertex.
accuracy=1e-3, #absolute and relative tolerance of the ODE solver.
# For further optional arguments, see documentation of 'NumericalParams'
MinimalOutput=true,
)
[...]
@time Solution, saved_values = SolveFRG(
Par,
UseMPI(),
MainFile=mainFile,
CheckpointDirectory=flowpath,
method=DP5(thread=OrdinaryDiffEq.True()),
VertexCheckpoints=[],
CheckPointSteps=3,
);
at various values of N:
- N=20:
total solver 1 11.0s 100.0% 11.0s 237MiB 100.0% 237MiB
getXBubble! 55 8.06s 73.6% 147ms 29.7MiB 12.5% 552KiB
- N=30
total solver 1 45.5s 100.0% 45.5s 691MiB 100.0% 691MiB
getXBubble! 55 36.7s 80.8% 668ms 66.5MiB 9.6% 1.21MiB
- N=40:
total solver 1 119s 100.0% 119s 1.50GiB 100.0% 1.50GiB
getXBubble! 49 99.2s 83.2% 2.02s 104MiB 6.8% 2.13MiB
- N=50:
total solver 1 337s 100.0% 337s 2.91GiB 100.0% 2.91GiB
getXBubble! 61 296s 88.0% 4.86s 204MiB 6.8% 3.34MiB
The time for one application of getXBubble! scales approximately as N^4, while the residual time in the solver for one application of getXBubble! scales approximately as N^3. This is nothing new. We still see, though, that a considerable amount of residual time is spent in the solver. There are 2 possible way forward to allow good MPI scalability:
My best guess is still that the culprit is the hermite interpolation of the solution to compute observables.
What happens if we change this line
to
callback = outputCB,
Or even remove it altogether? If the overhead is gone we have to think about how to make the hermite interpolation faster
If I do that change (alone), the code just fails compilation. For some reason I also need to comment out all the other saving functions later - possibly if saved_values
is empty HDF5 cannot recognize the type (as you possibly have mentioned elsewhere before).
I remember trying to delete that line but since I also had other things to do I did not try harder to pursue this idea - this time I managed.
With N = 25
, Nlen = 14
, with callbacks = CallbackSet(saveCB,outputCB)
I get:
total solver 1 23.4s 100.0% 23.4s 416MiB 100.0% 416MiB
getDeriv! 55 18.7s 79.8% 340ms 62.7MiB 15.1% 1.14MiB
getXBubble! 55 18.3s 78.3% 334ms 45.9MiB 11.0% 855KiB
With the callback = outputCB
, this is the output I get:
total solver 1 20.9s 100.0% 20.9s 406MiB 100.0% 406MiB
getDeriv! 55 18.9s 90.4% 344ms 62.7MiB 15.4% 1.14MiB
getXBubble! 55 18.5s 88.6% 337ms 45.9MiB 11.3% 855KiB
removing callback = ...
completely does not change the numbers significantly.
So the callbacks make some difference but that's only half of the remainder, we still have 10% to explain for.
In any case,
Or split the state (and deriv) among ranks, making sure that the integration steps are kept the same on all the ranks.
I suspect this would be an "easy" solution to make the code scale efficiently with more nodes. We would have to add the communication time, but it should be relatively small, possibly around 2 times the time spent in communication in the current MPI implementation.
One needs to make sure that problems like this are solved first, but there is the PencilArrays.jl package which, reading from that discussion, seems to have been modified exactly to fix this issue.
Possible issues:
An implementation would require at least to:
Perhaps we can still gain some understanding of what is causing the overhead in the solver.
I found this interesting note
In particular:
However, for optimality, you should make sure that the number of BLAS threads that you are using matches the number of physical cores and not the number of logical cores. See this issue for more details.
Perhaps we can still gain big improvements by trying the following:
Γ.a, Γ.b, Γ.c
all in one array for better BLAS performance. This doesnt mean we have to give up partitionarrays yet, maybe we allocate a big array zeros(Npairs,N,N,N,4)
first and then initialize the partition with views to that big array instead. If I have some time today I could try this out, or at least make a PR in your benchmarking repo.What do you think?
For now let's make future benchmarking without the callbacks to simplify I suggest.
I am investigating the interesting note you mention. Unfortunately this does not seem to work
ccall((:openblas_get_num_threads64_, Base.libblas_name), Cint, ())
but
import LinearAlgebra
LinearAlgebra.BLAS.get_num_threads()
does instead work. It reports 1, which seems bad. I can try to change the number of threads with
LinearAlgebra.BLAS.set_num_threads(Threads.nthreads())
but I see no difference in performance at all.
Ok, I didnt have quite as much time as I thought but at least we can be reasonably certain now that the overhead were seeing is due to the solver.
As a quick check, we can modify our ODE to be something trivial:
function getDeriv!(Deriv, State, setup, Lam)
Deriv[1] = 1.
return
end
if we also disable callbacks like mentioned above, we can perform a run for usual parameters and get an idea of the overhead of the ODE solver.
For example, for N=40, NLen=18
, I get
SciMLBase.DEStats(61, 0, 0, 0, 0, 0, 0, 0, 7, 3, 0.000000000000000)
26.034321 seconds (1.10 k allocations: 6.402 GiB, 3.03% gc time)
which is in the ballpark of the time difference that you found above. (Overall I think 26 seconds doesnt sound too bad but given the fact that the code seems to run quite fast overall on your machine it seems that this actually becomes a reasonable contributor) Here is the profile:
I also tried to see how much there is to gain from using a single contiguous array: defining
function InitializeState(Par::PMFRGParams)
VDims = getVDims(Par)
V = zeros(VDims...,3)
return V
end
we get
SciMLBase.DEStats(61, 0, 0, 0, 0, 0, 0, 0, 7, 3, 0.000000000000000)
17.246517 seconds (845 allocations: 6.402 GiB, 9.54% gc time)
a little better, though nothing out of this world.
looking at the profile, most time is spent in broadcasting in RK steps such as
@.. broadcast=false thread=thread tmp=uprev +
dt * (a51 * k1 + a52 * k2 + a53 * k3 + a54 * k4)
This @..
macro is supposed to use multithreading if we use for example PMFRG.DP5(thread=PMFRG.OrdinaryDiffEq.True()
, but in the profile that does not seem to happen.
Maybe I should make sure of that by developing OrdinaryDiffEq and seeing whether thread is still set to false somehow and whether setting it to true can improve it.
I have verified that internally, setting DP5(thread=PMFRG.OrdinaryDiffEq.True())
, indeed makes the thread
option be true.
It doesnt really seem to benefit from it though
I would like to reproduce your results on a single node, with 76 cores.
I was able to generate the textual profile dump but it is a little unwieldy.
By having a look at it, if I understand it correctly, the work done in threads does not show under the solve -> getDeriv! -> getXBubblePartition!
call path, but at a completely independent call path, which happens to be much shorter.
This makes things both much clearer but also confusing. Which, in itself, is confusing.
Anyway.
I remember ending up looking at FastBroadcast before while trying to understand if multithreading is used or not. I tried looking at this
julia> using FastBroadcast
julia> a = rand(10,10,10);
julia> b = rand(10,10,10);
julia> @macroexpand @.. thread=true c = a+b
quote
var"##fastbroadcast#228_1" = c
var"##fastbroadcast#228_2" = Base.broadcasted(+, a, b)
var"##fastbroadcast#228_3" = (FastBroadcast.fast_materialize!)(static(true), static(false), var"##fastbroadcast#228_1", var"##fastbroadcast#228_2")
end
If I understand correctly, Base.broadcasted
returns a Base.Broadcasted
object that represents a lazy computation.
Then FastBroadcast.fast_materialize
is called. Looking at the (hopefully) right method of FastBroadcast.fast_materialize!, it looks like Base.Broadcast.materialize
is used (which does not seem to be threaded - right?) unless use_fast_broadcast(S) == true
, where S
is the first type parameter of the Base.Broadcast
object.
From what I can see, defining State
as an ArrayPartition
:
julia> bb = Base.broadcasted(+,State,State);
julia> typeof(bb)
Base.Broadcast.Broadcasted{RecursiveArrayTools.ArrayPartitionStyle{Base.Broadcast.DefaultArrayStyle{3}}, Nothing, typeof(+), Tuple{ArrayPartition{F...
julia> FastBroadcast.use_fast_broadcast(typeof(bb).parameters[1]
false
While if State
were a simple zeros(Float64,(10,10,10))
we would get true
.
I suspect then that the difference you see in the performance with using "standard" arrays might be due, in addition to a (possibly slight) size difference, also to threading.
Notice that I would not expect such a low-arithmetic-intensity operation to benefit from O(100) cores on a single node in the same way as addX!
(which I suspect being strongly compute-bound) would do - in other words, I do not think we can speed up this as much as addX!
by just throwing more cores at it.
But we can possibly improve the performance by throwing more nodes at it, in the sense that if we can perform these operations in a data-parallel way on multiple nodes and we do not need to transfer the array data more often than once per call of getDeriv!, then we would definitely get a speedup by parallelizing this part with MPI as well.
Do PencilArrays
support fast broadcast, so that we can both use MPI and threading for this part? I will look this up in the next days.
Do PencilArrays support fast broadcast, so that we can both use MPI and threading for this part? I will look this up in the next days.
using MPI
using PencilArrays
using FastBroadcast
MPI.Init()
A = PencilArray{Float64}(undef,
Pencil((10,10),MPI.COMM_WORLD))
bb_lazy_pencil = Base.broadcasted(+,A,A)
println("Do PencilArrays support fast broadcast?",
FastBroadcast.use_fast_broadcast(typeof(bb_lazy_pencil).parameters[1]))
B = rand(Float64,(10,10))
bb_lazy_base = Base.broadcasted(+,B,B)
println("Do Base Arrays support fast broadcast? ",
FastBroadcast.use_fast_broadcast(typeof(bb_lazy_base).parameters[1]))
MPI.Finalize()
Output:
Do PencilArrays support fast broadcast? false
Do Base Arrays support fast broadcast? true
Of course I wonder if it is intentional for FastBroadcast
not to support PencilArrays
. Since FastBroadcast
uses Polyester
internally for threading, I think it should be possible to test/benchmark Polyester
with PencilArrays
quite easily, to see if it breaks or not.
Do you think it might be a good idea to ask about this on discourse? I am still not sure that the first step should be to go to some distributed array type. This will require a bunch of modifications in particular in the callbacks which currently need the full state array. Are you sure that there is no way to improve the performance of the ODE solver otherwise?
There are some considerations to make here:
function my_callback(state,::MultiThreaded)
<existing logic goes here>
end
function my_callback(state,::PMFRG.UseMPI, setup = nothing) global_state = .... # either initialize it here or take it from setup
MPI.Gather!(state,
global_state,
root = 0, # compute only on
MPI.COMM_WORLD)
if MPI.Comm_rank(MPI.COMM_WORLD) == 0
my_callback(global_state,MultiThreaded())
end
end
3. Since it seems that there is a little performance increase not using `ArrayPartition`s in any case, I would try to change the type of State to the one of an ordinary array, and then write some functions that create views of it that correspond to the different components. These functions can be tested thoroughly very easily, and then unpacking the components can be done with a single function so that in the existing code
f_int, gamma, Va, Vb, Vc = State.x
can be changed to
f_int, gamma, Va, Vb, Vc = unpack(State,Par)
which shouldn't be terribly inconvenient (I am assuming that it should not be a problem that f_int, gamma, and V_{a,b,c} are views. One could just make copies though, the price should not be terribly high).
4. Only once point 3. is done I would continue with using PencilArrays. Creating a PencilArray for each component in State separately makes things quite more complex (one would have to gather multiple times, for example).
What do you think?
Do you think it might be a good idea to ask about this on discourse?
For sure. I haven't done that yet because I thought it was better to discuss it first with you.
DP5()
has it but it is possible that Tsit5()
could be better here. In general it seems to be the recommended solver (although I found that for this package here it often requires a few more steps to solve the ODE)
f_int, gamma, Va, Vb, Vc = unpack(State,Par)
actually, I think it might be time to be even more radical. My next plan was anyway to go to a more generic vertex structure where all vertices are grouped together as one big array. I am now convinced that this makes more sense and might even improve performance in other ways in the future. So I think we can do something along the lines off_int, gamma, V = unpack(State,Par)
and rewrite flow equations in terms of explicit views with indices 1,2,3. Perhaps this looks slightly less clean in the current Heisenberg case, but it is more logical considering we will probably have more vertices in future generalizations.
True, but then we still lose quite a bit on the hermite interpolation ultimately.
I have a number of considerations I am more or less certain about. I assume the the interpolation is done on each variable in parallel. If this is true:
In any case, even if the performance of the interpolation is suboptimal, we should get scalability anyway by parallelizing the state.
The reason why I was suggesting to ask on discourse is that this seems to be a problem that people might have had before and there is still a possibility that we are missing an easy solution
I have seen this thread. Maybe asking explicitly a more general question would be appropriate - I am preparing a new thread there - preliminary title: "Sanctioned way to MPI-parallelize ODE solution?"
I will now try to work on point 3. to try and get an estimate of the performance increase on our machines when using an array type supported by FastBroadcast
. I should have some result by tomorrow, I think.
For the moment, I will stick to the 5-variable version, but when that is implemented it should be easy to change it into a 3-variable version as you suggest.
Then I will open the thread.
So, running
getDeriv!
takes between 95% and 98.2% of the total time, with some variability that I cannot understand at the moment. The effect of BLAS threading is not clear, it seems that using 1 thread gives the best performance but I would have to check on a longer run.I have tried with callbacks as well, the performance do not seem really stable though - I might need to run longer benchmarks, and try this with MPI as well.
To add to our previous discussion:
I think I now remember why it was not quite so easy to improve the performance of addX!
by arragning the vertices before:
In there, we have something like
Xa_sum += (+Va12[ki] * Va34[kj] + Vb12[ki] * Vb34[kj] * 2) * Ptm
where ki = S_ki[:, Rij]
In order to avoid gather instructions, we will have to pre-allocate two arrays for each vertex (Va12
and Va34
), and copy the appropriate vertex data into it.
Unfortunately, in addition, the buffers are not completely static, since we also need to use frequency symmetries, which may lead to an exchange Rji = System.invpairs[Rij]
. This means that depending on the values of nwpr+nw2
etc in the loop, we need to load the buffer with all the inverted pairs instead.
I think, I struggled to reduce the cost of all this copying around because the only option I saw was to fill the buffers within the loop over nwpr - in the end I broke even with what I already had in terms of performance.
I think I have also experimented with preallocating four buffer arrays for each vertex - One each for
Vx[ki,:,:,:]
Vx[kj,:,:,:]
Vx[invpairs[ki],:,:,:]
Vx[invpairs[kj],:,:,:]
Now that I think about it though, it still seems like this might work. Unless I miss something, this buffering we would only have to do once every ODE step.
Depending on the precise values of ns,nt,nu,nwpr
, we select which buffers to sum over a the start of addX!.
Note that this increases memory usage considerably, from Npairs* N^3
to 4*maximum(System.Nsum)*N^3
which can easily be a factor of over 30.
Once we have implemented the vertex as a single contiguous array it might be not super hard to implement though.
(Unrelated from the last post in this thread, just to record it here)
I have found a minimal working example of using PencilArrays
(which are MPI-based) and OrdinaryDiffEq
together in the test suite of PencilArrays
. I have extended it a little to make a prototype/minimal working example of what (I think) we would need, in this repo.
Continuation of https://github.com/mmesiti/PMFRG-copy.jl/issues/1 for this repository.