HPC - Githubissues

antoine-levitt commented 4 years ago

Opening this issue to track ideas on that front. I think we are already pretty good at multithreading. Do we want DFTK to work on GPUs and clusters?

GPU is probably the easiest to get working with minimal changes in the code. We can probably get the entire iterative eigensolver to work on the GPU quite easily. I'm not 100% sure of what the speedup would be though; we should maybe try to benchmark FFTs. I don't think FFTs are something GPUs are good at. Even nvidia, who usually enjoy throwing around irrealistic speedups, give a relatively modest "up to 10x" on their website, so I'd be surprised if we gained even 5x; nice but not worth investing huge effort into.
Distributed can be done with MPI, and would enable us to do pretty big systems. But that has to be done with explicit MPI calls and it increases the complexity of the code, so I'm a bit reluctant. Possibly things like https://github.com/JuliaParallel/DistributedArrays.jl and https://github.com/JuliaParallel/Elemental.jl might make it easier, but I don't know how ready these are.

mfherbst commented 4 years ago

Do we want DFTK to work on GPUs and clusters?

Yes, because why not ;)

I think the speedup really depends. I did some quick tests with FFTs on my GPU to have some numbers and they speak an interesting language. Using the testcode
```
using CuArrays
using FFTW
```

function test_fft(T, N) A = randn(Complex{T}, N) Ad = cu(A)

p = plan_fft(A)
pd = CuArrays.CUFFT.plan_fft(Ad)

println(typeof(p))
println(typeof(A))

println("CPU")
d = @time p * A

println("GPU")
resd = @time pd * Ad

println("  difference:  ", maximum(abs, d - Array(resd)))

end test_fft(N) = test_fft(Float64, N)

I get

julia> test_fft(Float32, 50_000_000) FFTW.cFFTWPlan{Complex{Float32},-1,false,1} Array{Complex{Float32},1} CPU 1.626926 seconds (2 allocations: 381.470 MiB, 5.09% gc time) GPU 0.001284 seconds (84 allocations: 2.469 KiB) difference: 0.012333482

and

julia> test_fft(Float64, 50_000_000) FFTW.cFFTWPlan{Complex{Float64},-1,false,1} Array{Complex{Float64},1} CPU 2.374748 seconds (2 allocations: 762.940 MiB, 0.82% gc time) GPU 0.002024 seconds (86 allocations: 2.500 KiB) difference: 2.2791871899501835e-11


which I find impressive. One should say that the CPU on there is not the most recent and the Julia is not compiled specific to the machine (but the CUDA obviously uses all the tricks for the GPU), so one has to be a little careful. But still I'd say the order of magnitude is worth a try. Also: On mixed GPU / CPU clusters you have the GPUs lying around anyway and being able to use them would be interesting.

- DistributedArrays seems reasonable. I have not played with it a lot, however. Neither have I with Elemental, so I can't really say. Since we do not actually have a lot of data to pass between nodes if we parallelise on the level of k-Points, one other option would also be to use [ZMQ](https://github.com/JuliaInterop/ZMQ.jl) to pass around only the densities at the `k`-Points / the full density whenever the density is updated and for the rest use only threads.

antoine-levitt commented 4 years ago

Re GPU timings, that looks unlikely. GPU do things asynchroneously, and you might need an explicit synchronize to get correct timings. Could you test doing a bunch of them in sequence with an explicit sync at the end? Also maybe throw in a CPU/GPU transfer in there to see the impact? For sizes, the ones in https://gist.github.com/antoine-levitt/88086895dd98f746d6c795c99a10fd9f (128x128x128x40) are a reasonable starting point (they should be representative of a moderately expensive computation)

Re distributed, kpoints are indeed pretty easy. But that doesn't really help much because when you want to do large systems (eg inhomogeneous systems with defects etc.) you usually don't have kpoint at all. To really do these large systems, you have to distribute on both the bands and the G vectors, which is more annoying to code.

mfherbst commented 4 years ago

I just talked to the two guys from Oceananigans.jl at 36c3 and they suggested looking at MPIArrays and https://juliagpu.org/ and also to keep tracking developments in DistributedTranspose.jl.

antoine-levitt commented 4 years ago

Yeah MPIArrays looks like the way to go, but I'm kind of afraid to use it because it looks pretty young and abandonned. Elemental seems a bit more mature. I don't know how feasible it is to have DFTK be completely agnostic to the array type. That said, the places where we do need data distribution (at least at the beginning) is just the Psi, which is relatively isolated: just LOBPCG and Hamiltonian application (so FFT and nonlocal projectors). It shouldn't actually be that much work in the end. Actually, just the first level of parallelism (just distribute the columns of Psi; that means hamiltonian applications are completely independent, and the only synchronization points are matrix-matrix products inside the eigensolver) should be pretty easy. I think that alone gets us to a few tens of nodes (so hundreds to thousands of processors) easily, so an order of magnitude gain compared to what we have now. Might be worth a try (but I'm too lazy to do it).

mfherbst commented 4 years ago

Agree. It seems there's quite some motion with regards to parallelisation and distributed stuff at the moment, which should bear fruit soon.

I'd have it more on the list for the next work package (i.e. when the "simple" things we are working on right now are out).

antoine-levitt commented 4 years ago

See https://discourse.julialang.org/t/ann-pencilffts-parallel-ffts-of-mpi-distributed-arrays/33191 for MPI FFTs (although this probably doesn't get us very much for now, as we have rather small FFTs but lots of them, so it's more efficient to parallelize on eigenvectors)

antoine-levitt commented 3 years ago

OK so regarding multiple-nodes we have three possible levels of parallelism: kpoints, bands and G vectors. The first is the easiest by far, so we might as well start there. The Fortran way of doing this is to restrict basis.kpoints to the kpoints treated by the current processor, with psi only containing these kpoints. Most routines only work k by k (eg hamiltonian application) and do not need to be modified at all. Others accumulate (eg density construction) and just need an MPI_ALLREDUCE at the end. Some (but not a lot, and none of the core computation) need all the information. For that I think the simplest is to error out if the input is distributed, and provide conversions from a distributed basis/scfres to a centralized one.

mfherbst commented 3 years ago

Sounds reasonable. Any reason for not using something more Julian like DistributedArrays.jl ?

antoine-levitt commented 3 years ago

I don't have any experience with it, but 1) it doesn't seem to gain us much compared to MPI 2) it doesn't seem to be very actively used 3) it's not backed by MPI so I wouldn't expect good performance on supercomputers. It does mean that unmodified postprocessing routines should work without modification (although slowly), though.

antoine-levitt commented 3 years ago

New contenders from https://discourse.julialang.org/t/ann-mpi-jl-v0-10-0-new-build-process-and-cuda-aware-support/27601/15 https://github.com/jipolanco/PencilArrays.jl and https://github.com/kose-y/DistStat.jl.

mfherbst commented 3 years ago

Awesome. Yes PencilArrays looks like the best shot at the moment.

antoine-levitt commented 3 years ago

Doesn't look flexible enough for what we want to do... straight MPI seems the simplest for now.

mfherbst commented 3 years ago

Why? Because you would get all band data in one big array? In combination with https://github.com/SciML/RecursiveArrayTools.jl for the non-MPI case I don't think that's an issue.

antoine-levitt commented 3 years ago

No, because there are a number of limitations on the distribution of the array, and it won't actually save us from having to manually make a number of MPI calls, eg MPI_Sum anyway. So it's just not a good enough abstraction for our use case.

mfherbst commented 3 years ago

Too bad ... maybe one day.

bhalonen commented 3 years ago

Guys, about to order a big desktop to run DFTK, wondering on how many threads can be efficiently used by DFTK on a larger simulation.

mfherbst commented 3 years ago

Awesome! Great to see people go production with DFTK!

Right now not yet so many (we've used at around 8 cores for most calculations, but improvements start to stagnate at 4 from my experience). I am running a systematic study at the moment on a few representative bigger systems also to calibrate the threading a bit better. If you are patient for another week or so I can give you a nice answer with some data to accompany it :smile: ... look out for the changes in the docs.

Also we are implementing MPI-based parallelism over k-Points at the moment which we expect to scale a bit better. What sort of magnitude have do you have in mind?

antoine-levitt commented 3 years ago

Well, your timing is impeccable :-) https://github.com/JuliaMolSim/DFTK.jl/pull/347 What kind of systems do you want to run?

bhalonen commented 3 years ago

I've discussed with @mfherbst on email, but we are simulating the surface of molten glass against ceramics and metals. So fairly large systems, but we don't need exact numbers, but a rough (but consistent) measure of surface interactions so we can reduce size as long as we keep that level of accuracy.

mfherbst commented 3 years ago

So that means you probably use few k-points but many bands? That's unfortunately the level where #347 does not help so much, I suppose. But of course you can always run multiple simulations in parallel if that's an option.

By the way: If your systems converge nicely with Ecut that you can always cut down on that a bit and also you can try commit variational crimes (by using determine_fft_size with supersampling less that 2, like 1.5 or so). That lowers cost and if the systems are similar enough it would be my guess that the errors are systematic (but check this).

bhalonen commented 3 years ago

We can definitely run simulations in parallel, just trying to determine how many in parallel we need to have going to keep the computer busy.

mfherbst commented 3 years ago

Preliminary results from my study are that by using 4 Julia threads and 4 BLAS threads for a calculation you can cut down the time by a good factor of 2 compared to using no threading. But since these types of threading are not multiplicative that does not imply keeping 16 cores busy. It's more like a load of 6 to 8 if I remember correctly, but that's system dependent as well. From there on the gains really start to diminish ... so clearly DFTK is not yet HPC at the current stage :smile:.

antoine-levitt commented 3 years ago

If you've got kpoints (which you should have for a surface) you can go higher for sure once we're done with MPI. Also there's a nonzero chance I get carried away and MPIify bands also. It all depends on the size of the system, but if you've got a computation that takes an hour in serial let's say you should be able to get very good scaling on 30 processes very easily.

bhalonen commented 3 years ago

that's good enough of an indicator for what I'm looking for., thanks.

Really great package, and fun to see your development arc play out in real time. You guys are blindingly fast.

mfherbst commented 3 years ago

Thanks. Great to see we're attracting users to test our code in real life.

antoine-levitt commented 3 years ago

Closing in favor of more specific issues: https://github.com/JuliaMolSim/DFTK.jl/issues/350 https://github.com/JuliaMolSim/DFTK.jl/issues/349 https://github.com/JuliaMolSim/DFTK.jl/issues/348

JuliaMolSim / DFTK.jl

HPC #90