NilsNiggemann / PMFRG.jl

MIT License
1 stars 1 forks source link

Performance Engineering Master plan #8

Open mmesiti opened 10 months ago

mmesiti commented 10 months ago

Continuation of https://github.com/mmesiti/PMFRG-copy.jl/issues/1 for this repository.

mmesiti commented 8 months ago

(Copied from the last post of the issue in the old repo)

mmesiti commented 8 months ago

Also, since now julia 1.10 is based on LLVM 17 (IIRC), it should be possible to tell LLVM to try and avoid gather instruction (see relevant thread on julia discourse).

NilsNiggemann commented 8 months ago

For reference, here's the post

I suppose the only way is to document that users of Intel CPUs should disable gather? I don't know if there can be a foolproof way to do it. Ideally this would be done in LoopVectorization but that seems unlikely since gather might not always be the slower option.

Btw what happens if one removes @turbo altogether?

mmesiti commented 8 months ago

Btw what happens if one removes @turbo altogether?

We might test that, but I think that the mitigation of Downfall does not affect all vector instructions, only the vgather ones.

mmesiti commented 8 months ago

@NilsNiggemann I finally have some results for

Run again for S = getCubic(18,[1.]); Par = Params(S,N=50,T=0.4,accuracy = 1e-9), SolveFRG(Par,method = VCABM(),

I will profile this large case to get more understanding.

mmesiti commented 8 months ago

A clarification on this:

The problem of the MPI approach is that this part cannot scale.

It could scale if we also split the State object between ranks. But still I need to understand what is happening outside our client code and inside solve(), and why it is so large.

mmesiti commented 8 months ago

I did try to profile SolveFRG using @profile, but there are some additional complications. For example, the ratio of the number of samples collected for getXBubble! and the number of samples for the whole SolveFRG does not actually match the ratio of the times measured with TimerOutputs, but I believe this is due to the fact that the multithreaded part for some reason sits somewhere else in the call tree (I guess then most of the work is will be done under a function like "run task" or something similar).

I expected that for large N the time spent in getXBubble! would raise somewhat rapidly, while the time in the solver would rise much more slowly. These are some data for

System = getSquareLattice(NLen, couplings)
[...]
Par = Params( #create a group of all parameters to pass them to the FRG Solver
    System, # geometry, this is always required
    OneLoop(), # method. OneLoop() is the default
    T=0.5, # Temperature for the simulation.
    N=N, # Number of positive Matsubara frequencies for the four-point vertex.
    accuracy=1e-3, #absolute and relative tolerance of the ODE solver.
    # For further optional arguments, see documentation of 'NumericalParams'
    MinimalOutput=true,
)
[...] 
@time Solution, saved_values = SolveFRG(
    Par,
    UseMPI(),
    MainFile=mainFile,
    CheckpointDirectory=flowpath,
    method=DP5(thread=OrdinaryDiffEq.True()),
    VertexCheckpoints=[],
    CheckPointSteps=3,
);

at various values of N:

- N=20:
 total solver               1    11.0s  100.0%   11.0s    237MiB  100.0%   237MiB
     getXBubble!           55    8.06s   73.6%   147ms   29.7MiB   12.5%   552KiB
- N=30
 total solver               1    45.5s  100.0%   45.5s    691MiB  100.0%   691MiB
     getXBubble!           55    36.7s   80.8%   668ms   66.5MiB    9.6%  1.21MiB
- N=40:
 total solver               1     119s  100.0%    119s   1.50GiB  100.0%  1.50GiB
     getXBubble!           49    99.2s   83.2%   2.02s    104MiB    6.8%  2.13MiB
- N=50:
 total solver               1     337s  100.0%    337s   2.91GiB  100.0%  2.91GiB
     getXBubble!           61     296s   88.0%   4.86s    204MiB    6.8%  3.34MiB

The time for one application of getXBubble! scales approximately as N^4, while the residual time in the solver for one application of getXBubble! scales approximately as N^3. This is nothing new. We still see, though, that a considerable amount of residual time is spent in the solver. There are 2 possible way forward to allow good MPI scalability:

NilsNiggemann commented 8 months ago

My best guess is still that the culprit is the hermite interpolation of the solution to compute observables.

What happens if we change this line

to

        callback = outputCB,

Or even remove it altogether? If the overhead is gone we have to think about how to make the hermite interpolation faster

mmesiti commented 8 months ago

If I do that change (alone), the code just fails compilation. For some reason I also need to comment out all the other saving functions later - possibly if saved_values is empty HDF5 cannot recognize the type (as you possibly have mentioned elsewhere before). I remember trying to delete that line but since I also had other things to do I did not try harder to pursue this idea - this time I managed.

With N = 25, Nlen = 14, with callbacks = CallbackSet(saveCB,outputCB) I get:

 total solver               1    23.4s  100.0%   23.4s    416MiB  100.0%   416MiB
   getDeriv!               55    18.7s   79.8%   340ms   62.7MiB   15.1%  1.14MiB
     getXBubble!           55    18.3s   78.3%   334ms   45.9MiB   11.0%   855KiB

With the callback = outputCB, this is the output I get:

 total solver               1    20.9s  100.0%   20.9s    406MiB  100.0%   406MiB
   getDeriv!               55    18.9s   90.4%   344ms   62.7MiB   15.4%  1.14MiB
     getXBubble!           55    18.5s   88.6%   337ms   45.9MiB   11.3%   855KiB

removing callback = ... completely does not change the numbers significantly.

So the callbacks make some difference but that's only half of the remainder, we still have 10% to explain for.

mmesiti commented 8 months ago

In any case,

Or split the state (and deriv) among ranks, making sure that the integration steps are kept the same on all the ranks.

I suspect this would be an "easy" solution to make the code scale efficiently with more nodes. We would have to add the communication time, but it should be relatively small, possibly around 2 times the time spent in communication in the current MPI implementation.

One needs to make sure that problems like this are solved first, but there is the PencilArrays.jl package which, reading from that discussion, seems to have been modified exactly to fix this issue.

Possible issues:

An implementation would require at least to:

NilsNiggemann commented 8 months ago

Perhaps we can still gain some understanding of what is causing the overhead in the solver.

I found this interesting note

In particular:

However, for optimality, you should make sure that the number of BLAS threads that you are using matches the number of physical cores and not the number of logical cores. See this issue for more details.

Perhaps we can still gain big improvements by trying the following:

  1. Set the number of BLAS threads to the number of cores
  2. Try the MKL backend
  3. Try different solvers, experiment with higher and lower order solvers
  4. Try to put Γ.a, Γ.b, Γ.c all in one array for better BLAS performance. This doesnt mean we have to give up partitionarrays yet, maybe we allocate a big array zeros(Npairs,N,N,N,4) first and then initialize the partition with views to that big array instead. If I have some time today I could try this out, or at least make a PR in your benchmarking repo.

What do you think?

NilsNiggemann commented 8 months ago

For now let's make future benchmarking without the callbacks to simplify I suggest.

mmesiti commented 8 months ago

I am investigating the interesting note you mention. Unfortunately this does not seem to work

ccall((:openblas_get_num_threads64_, Base.libblas_name), Cint, ())

but

import LinearAlgebra
LinearAlgebra.BLAS.get_num_threads()

does instead work. It reports 1, which seems bad. I can try to change the number of threads with

LinearAlgebra.BLAS.set_num_threads(Threads.nthreads())

but I see no difference in performance at all.

NilsNiggemann commented 8 months ago

Ok, I didnt have quite as much time as I thought but at least we can be reasonably certain now that the overhead were seeing is due to the solver.

As a quick check, we can modify our ODE to be something trivial:

function getDeriv!(Deriv, State, setup, Lam)
    Deriv[1] = 1.
    return
end

if we also disable callbacks like mentioned above, we can perform a run for usual parameters and get an idea of the overhead of the ODE solver.

For example, for N=40, NLen=18, I get

SciMLBase.DEStats(61, 0, 0, 0, 0, 0, 0, 0, 7, 3, 0.000000000000000)
 26.034321 seconds (1.10 k allocations: 6.402 GiB, 3.03% gc time)

which is in the ballpark of the time difference that you found above. (Overall I think 26 seconds doesnt sound too bad but given the fact that the code seems to run quite fast overall on your machine it seems that this actually becomes a reasonable contributor) Here is the profile: image

I also tried to see how much there is to gain from using a single contiguous array: defining

function InitializeState(Par::PMFRGParams)

    VDims = getVDims(Par)
    V = zeros(VDims...,3)
    return V
end

we get

SciMLBase.DEStats(61, 0, 0, 0, 0, 0, 0, 0, 7, 3, 0.000000000000000)
 17.246517 seconds (845 allocations: 6.402 GiB, 9.54% gc time)

a little better, though nothing out of this world.

looking at the profile, most time is spent in broadcasting in RK steps such as

    @.. broadcast=false thread=thread tmp=uprev +
                                          dt * (a51 * k1 + a52 * k2 + a53 * k3 + a54 * k4)

This @.. macro is supposed to use multithreading if we use for example PMFRG.DP5(thread=PMFRG.OrdinaryDiffEq.True(), but in the profile that does not seem to happen.

Maybe I should make sure of that by developing OrdinaryDiffEq and seeing whether thread is still set to false somehow and whether setting it to true can improve it.

NilsNiggemann commented 8 months ago

I have verified that internally, setting DP5(thread=PMFRG.OrdinaryDiffEq.True()), indeed makes the thread option be true. It doesnt really seem to benefit from it though

mmesiti commented 8 months ago

I would like to reproduce your results on a single node, with 76 cores. I was able to generate the textual profile dump but it is a little unwieldy. By having a look at it, if I understand it correctly, the work done in threads does not show under the solve -> getDeriv! -> getXBubblePartition! call path, but at a completely independent call path, which happens to be much shorter. This makes things both much clearer but also confusing. Which, in itself, is confusing.

Anyway.

I remember ending up looking at FastBroadcast before while trying to understand if multithreading is used or not. I tried looking at this

julia> using FastBroadcast
julia> a = rand(10,10,10);
julia> b = rand(10,10,10);
julia> @macroexpand @.. thread=true c = a+b
quote
    var"##fastbroadcast#228_1" = c
    var"##fastbroadcast#228_2" = Base.broadcasted(+, a, b)
    var"##fastbroadcast#228_3" = (FastBroadcast.fast_materialize!)(static(true), static(false), var"##fastbroadcast#228_1", var"##fastbroadcast#228_2")
end

If I understand correctly, Base.broadcasted returns a Base.Broadcasted object that represents a lazy computation. Then FastBroadcast.fast_materialize is called. Looking at the (hopefully) right method of FastBroadcast.fast_materialize!, it looks like Base.Broadcast.materialize is used (which does not seem to be threaded - right?) unless use_fast_broadcast(S) == true , where S is the first type parameter of the Base.Broadcast object.

From what I can see, defining State as an ArrayPartition:

julia> bb = Base.broadcasted(+,State,State);
julia> typeof(bb) 
Base.Broadcast.Broadcasted{RecursiveArrayTools.ArrayPartitionStyle{Base.Broadcast.DefaultArrayStyle{3}}, Nothing, typeof(+), Tuple{ArrayPartition{F...
julia> FastBroadcast.use_fast_broadcast(typeof(bb).parameters[1]
false

While if State were a simple zeros(Float64,(10,10,10)) we would get true. I suspect then that the difference you see in the performance with using "standard" arrays might be due, in addition to a (possibly slight) size difference, also to threading.

Notice that I would not expect such a low-arithmetic-intensity operation to benefit from O(100) cores on a single node in the same way as addX! (which I suspect being strongly compute-bound) would do - in other words, I do not think we can speed up this as much as addX! by just throwing more cores at it.

But we can possibly improve the performance by throwing more nodes at it, in the sense that if we can perform these operations in a data-parallel way on multiple nodes and we do not need to transfer the array data more often than once per call of getDeriv!, then we would definitely get a speedup by parallelizing this part with MPI as well.

Do PencilArrays support fast broadcast, so that we can both use MPI and threading for this part? I will look this up in the next days.

mmesiti commented 8 months ago

Do PencilArrays support fast broadcast, so that we can both use MPI and threading for this part? I will look this up in the next days.

using MPI
using PencilArrays
using FastBroadcast

MPI.Init()

A = PencilArray{Float64}(undef,
                          Pencil((10,10),MPI.COMM_WORLD))

bb_lazy_pencil = Base.broadcasted(+,A,A)

println("Do PencilArrays support fast broadcast?",
              FastBroadcast.use_fast_broadcast(typeof(bb_lazy_pencil).parameters[1]))

B = rand(Float64,(10,10))

bb_lazy_base = Base.broadcasted(+,B,B)
println("Do Base Arrays support fast broadcast? ",
              FastBroadcast.use_fast_broadcast(typeof(bb_lazy_base).parameters[1]))

MPI.Finalize()

Output:

Do PencilArrays support fast broadcast? false
Do Base Arrays support fast broadcast? true
mmesiti commented 8 months ago

Of course I wonder if it is intentional for FastBroadcast not to support PencilArrays. Since FastBroadcast uses Polyester internally for threading, I think it should be possible to test/benchmark Polyester with PencilArrays quite easily, to see if it breaks or not.

NilsNiggemann commented 8 months ago

Do you think it might be a good idea to ask about this on discourse? I am still not sure that the first step should be to go to some distributed array type. This will require a bunch of modifications in particular in the callbacks which currently need the full state array. Are you sure that there is no way to improve the performance of the ODE solver otherwise?

mmesiti commented 8 months ago

There are some considerations to make here:

  1. I might be obsessed by trying to get this to scale to number of nodes on machines that you are not interested in. In that case, this could be a waste of time. By parallelizing the solver state between nodes, if we keep running on 2 nodes, we might get, for what I have seen, only a ~10% speedup. For larger number of nodes the speedup of this might be crucial to getting decent performance.
  2. The callbacks can still work on the full array, they "just" need to do a "gather" first. I think one could do something like this (a bit vague, I know)
    
    function my_callback(state,::MultiThreaded)
    <existing logic goes here>
    end

In the extension:

function my_callback(state,::PMFRG.UseMPI, setup = nothing) global_state = .... # either initialize it here or take it from setup

Creation of the setup object might need to be dispatched on the parallelization scheme, then

MPI.Gather!(state, 
                      global_state, 
                      root = 0, # compute only on 
                      MPI.COMM_WORLD)
if MPI.Comm_rank(MPI.COMM_WORLD) == 0
    my_callback(global_state,MultiThreaded())
end

end

3. Since it seems that there is a little performance increase not using `ArrayPartition`s in any case, I would try to change the type of State to the one of an ordinary array, and then write some functions that create views of it that correspond to the different components. These functions can be tested thoroughly very easily, and then unpacking the components can be done with a single function so that in the existing code 
    f_int, gamma, Va, Vb, Vc = State.x
can be changed to 
    f_int, gamma, Va, Vb, Vc = unpack(State,Par)

which shouldn't be terribly inconvenient (I am assuming that it should not be a problem that f_int, gamma, and V_{a,b,c} are views. One could just make copies though, the price should not be terribly high).
4. Only once point 3. is done I would continue with using PencilArrays. Creating a PencilArray for each component in State separately makes things quite more complex (one would have to gather multiple times, for example).

What do you think?
mmesiti commented 8 months ago

Do you think it might be a good idea to ask about this on discourse?

For sure. I haven't done that yet because I thought it was better to discuss it first with you.

NilsNiggemann commented 8 months ago
  1. I agree with you that it would be nice to get the scaling as optimal as possible. The reason why I was suggesting to ask on discourse is that this seems to be a problem that people might have had before and there is still a possibility that we are missing an easy solution. Of course, doing several broadcasts for large arrays is not completely trivial so it might be that we are ultimately forced to go to some sort of distributed array format if we want to get the performance right. I think asking doesn't harm, at the moment I dont have another idea of how to improve.
  2. True, but then we still lose quite a bit on the hermite interpolation ultimately. As of now, I am not super sure this can be fullly resolved. Actually we can probably play around more with the solvers. Some get the interpolant for free and some dont. I believe DP5() has it but it is possible that Tsit5() could be better here. In general it seems to be the recommended solver (although I found that for this package here it often requires a few more steps to solve the ODE)
  3. f_int, gamma, Va, Vb, Vc = unpack(State,Par) actually, I think it might be time to be even more radical. My next plan was anyway to go to a more generic vertex structure where all vertices are grouped together as one big array. I am now convinced that this makes more sense and might even improve performance in other ways in the future. So I think we can do something along the lines of f_int, gamma, V = unpack(State,Par) and rewrite flow equations in terms of explicit views with indices 1,2,3. Perhaps this looks slightly less clean in the current Heisenberg case, but it is more logical considering we will probably have more vertices in future generalizations.

mmesiti commented 8 months ago

True, but then we still lose quite a bit on the hermite interpolation ultimately.

I have a number of considerations I am more or less certain about. I assume the the interpolation is done on each variable in parallel. If this is true:

  1. Parallelizing the state among nodes should give us perfect scalability of the interpolation in the number of nodes
  2. Something similar should happen as well if we use more cores, but according to the quick test you made it does not seem to be the case. I think this might be because the algorithm, in its current form, could be memory bound, and/or the work to do on each iteration could be so small that overheads might cause issues. I think I need to have a look at the algorithm (finding it first would be necessary), because I can imagine that the optimal implementation of such and interpolation for an array type might differ substantially from the optimal implementation for a scalar (data reuse might be the issue), but without seeing the code it becomes pretty difficult to say.

In any case, even if the performance of the interpolation is suboptimal, we should get scalability anyway by parallelizing the state.

The reason why I was suggesting to ask on discourse is that this seems to be a problem that people might have had before and there is still a possibility that we are missing an easy solution

I have seen this thread. Maybe asking explicitly a more general question would be appropriate - I am preparing a new thread there - preliminary title: "Sanctioned way to MPI-parallelize ODE solution?"

I will now try to work on point 3. to try and get an estimate of the performance increase on our machines when using an array type supported by FastBroadcast. I should have some result by tomorrow, I think. For the moment, I will stick to the 5-variable version, but when that is implemented it should be easy to change it into a 3-variable version as you suggest.

Then I will open the thread.

mmesiti commented 7 months ago

So, running

I have tried with callbacks as well, the performance do not seem really stable though - I might need to run longer benchmarks, and try this with MPI as well.

NilsNiggemann commented 7 months ago

https://docs.julialang.org/en/v1/base/arrays/#Base.Broadcast.AbstractArrayStyle

NilsNiggemann commented 7 months ago

To add to our previous discussion: I think I now remember why it was not quite so easy to improve the performance of addX! by arragning the vertices before: In there, we have something like

Xa_sum += (+Va12[ki] * Va34[kj] + Vb12[ki] * Vb34[kj] * 2) * Ptm

where ki = S_ki[:, Rij] In order to avoid gather instructions, we will have to pre-allocate two arrays for each vertex (Va12 and Va34), and copy the appropriate vertex data into it. Unfortunately, in addition, the buffers are not completely static, since we also need to use frequency symmetries, which may lead to an exchange Rji = System.invpairs[Rij]. This means that depending on the values of nwpr+nw2 etc in the loop, we need to load the buffer with all the inverted pairs instead. I think, I struggled to reduce the cost of all this copying around because the only option I saw was to fill the buffers within the loop over nwpr - in the end I broke even with what I already had in terms of performance.

I think I have also experimented with preallocating four buffer arrays for each vertex - One each for

Now that I think about it though, it still seems like this might work. Unless I miss something, this buffering we would only have to do once every ODE step. Depending on the precise values of ns,nt,nu,nwpr, we select which buffers to sum over a the start of addX!. Note that this increases memory usage considerably, from Npairs* N^3 to 4*maximum(System.Nsum)*N^3 which can easily be a factor of over 30.

Once we have implemented the vertex as a single contiguous array it might be not super hard to implement though.

mmesiti commented 7 months ago

(Unrelated from the last post in this thread, just to record it here) I have found a minimal working example of using PencilArrays (which are MPI-based) and OrdinaryDiffEq together in the test suite of PencilArrays. I have extended it a little to make a prototype/minimal working example of what (I think) we would need, in this repo.