Multi GPU scaling is very poor

hennyg888 commented 3 years ago

I recently ran the weak scaling shallow water model benchmark with the MultiGPU architecture on Satori, thanks to @christophernhill. Here are the results:

size | ranks | min | median | mean | max | memory | allocs | samples -- | -- | -- | -- | -- | -- | -- | -- | -- (4096, 256) | (1, 1) | 2.765 ms | 2.786 ms | 2.849 ms | 3.374 ms | 2.03 MiB | 5535 | 10 (4096, 512) | (1, 2) | 6.932 ms | 7.081 ms | 8.037 ms | 26.174 ms | 2.03 MiB | 5859 | 20 (4096, 1024) | (1, 4) | 12.592 ms | 14.603 ms | 16.417 ms | 31.468 ms | 2.03 MiB | 5859 | 40 size | ranks | slowdown | efficiency | memory | allocs -- | -- | -- | -- | -- | -- (4096, 256) | (1, 1) | 1.0 | 1.0 | 1.0 | 1.0 (4096, 512) | (1, 2) | 2.54127 | 0.393505 | 1.00271 | 1.05854 (4096, 1024) | (1, 4) | 5.24053 | 0.19082 | 1.00271 | 1.05854

The results are not good but at least we can benchmark multi-GPU performance now.

glwagner commented 3 years ago

Perhaps we can name this issue "Multi GPU scaling is very poor" so that we can resolve when the scaling gets better :-D

francispoulin commented 3 years ago

@hennyg888 , could you tell us exactly what branch and script you used to produce this result?

glwagner commented 3 years ago

Just for clarification, how is "efficiency" defined?

francispoulin commented 3 years ago

The total time for the serial job divided by the product of the number fo cores multiplied by the time for that run, say

N_1 / (p * N_p)

where N_1 is the time for 1 core, p is the number of cores, and N_p is the time for p cores.

glwagner commented 3 years ago

Ah nice thanks. Makes sense, between 0 and 1.

francispoulin commented 3 years ago

The fact that the efficiency goes down to 40% for 2 gpus says that it's actually running slower than on one core. Certainly suboptimal. I'm sure we can do better, and we will.

glwagner commented 3 years ago

Is the problem being parallelized in y? Would it be better to use a problem that is relatively wide in the direction being parallelized? Eg layouts like (256, 512) with (1, 1); (256, 1024) with (1, 2), etc.

hennyg888 commented 3 years ago

The total time for the serial job divided by the product of the number fo cores multiplied by the time for that run, say

N_1 / (p * N_p)

where N_1 is the time for 1 core, p is the number of cores, and N_p is the time for p cores.

This is actually weak scaling so the efficiency is just N_1 / N_p, and median times are used not mean.

@hennyg888 , could you tell us exactly what branch and script you used to produce this result?

I used the latest master branch and weak_scaling_shallow_water_model.jl and weak_scaling_shallow_water_model_single.jl except the architecture was changed from MultiCPU to MultiGPU.

glwagner commented 3 years ago

This is actually weak scaling so the efficiency is just N_1 / N_p, and median times are used not mean.

Right, that makes sense. What I said was wrong; 1 would not be an upper bound unless magic happened. efficiency=0.5 means that the problem takes roughly the same amount of time it would take if one continued to use a single core, rather than parallelized.

The layout issue I point out above holds --- I think these problems have large "surface area" compared to computation so may not be the best target for parallelization. Unless I'm missing something.

Another thing is that I'm not sure these problems are big enough. We can run problems with ~30 million dof (sometimes more). But 4096x256 has just ~1 million dof. Do we know how much GPU utilization we are getting with 1 million dof?

vchuravy commented 3 years ago

Important would be to capture the environment used. Could you share your SLURM script and setup? Which modules did you use etc.

Secondly we should do some profiling to see where the times goes. (Does oceanigans have something like that? Either based on CLIMA's TikTok or TimerOutputs.jl)

francispoulin commented 3 years ago

I will let @hennyg888 share the SLURM and module information but I can say that we are keen to do some profiling of this, and other runs.

I have not heard of Oceananigans having any profiling but would love to hear what people suggest we use. We were considering nvprof as that seems easy to start using but we are open to suggestions.

francispoulin commented 3 years ago

I have tried running the library ImplicitGlobalGrid.jlin 1, 2 and 4 gpus and have actually found rather bad scaling as well, 57 and 35 percent.

I have created an issue and hope they might have some suggestions as to how to improve the results. Maybe what I learn there might be transferable to Oceananigans?

vchuravy commented 3 years ago

I have tried running the library ImplicitGlobalGrid.jlin 1, 2 and 4 gpus and have actually found rather bad scaling as well, 57 and 35 percent.

I have created an issue and hope they might have some suggestions as to how to improve the results. Maybe what I learn there might be transferable to Oceananigans?

This then sounds to me like you don't have a working CUDA-aware MPI. IGG should show >90% efficiency

vchuravy commented 3 years ago

As I said, please post your slurp script and other environment options. It is impossible to debug otherwise.

I have an annotated slurp script here https://github.com/CliMA/ClimateMachine.jl/wiki/Satori-Cluster which is what I used a while back for GPU scaling tests. A mossconfigured MPI can easily manifest itself as scaling this poor.

christophernhill commented 3 years ago

@hennyg888 thanks posting this. a few thoughts -

I assume what @hennyg888 is running is based on this https://github.com/christophernhill/onan-jcon2021-bits/blob/main/run/satori/run-on-bench-on-rhel7-satori-with-mpi ?

There are quite a few things to double (triple) check

are you running on multiple GPUs? There is some obscure foo for that here ( https://github.com/christophernhill/ImplicitGlobalGrid.jl/blob/5e4fd0698b7087467d9314bfa253d6bc9a09a40a/diffusion3D_multigpu_CuArrays_novis.jl#L21 ) that is not in Oceananigans or ImplicitGlobal as downloaded. Its not really particularly documented anywhere either (except in a blog post for this https://github.com/NVIDIA-developer-blog/code-samples/blob/master/posts/cuda-aware-mpi-example/src/CUDA_Aware_MPI.c as far as I can tell)! Without this bit you may end up running all ranks on the same GPU. The blog post here https://developer.nvidia.com/blog/benchmarking-cuda-aware-mpi/ gives a bit of background.
is there anything else running on the node when you test? When I looked earlier in the week satori had become annoyingly busy. You need to request an exclusive node - and then wait unfortunately because of every else using. If you skip asking for exclusive you may end up sharing node - which is OK for getting work done, but confusing for benchmarking.
as @vchuravy mentions you may or may not be using messaging that goes direct GPU to GPU? There is an issue with recent CUDA.jl that makes that hard (possibly not even possible). We are working to resolve that. @vchuravy has a suggested fix, but I found that caused other problems. The ImplicitGlobal team found @vchuravy fix to work, but with a very recent version of CUDA.jl where I think it isn't supposed to work - so they may have been mistaken.

I was planning to look at this a bit more after having coffee with a Nvidia colleague who is involved in all this tomorrow.

The ImplciitGlobalGris stuff should get reasonable behavior with the selectDevices() addition - but I think Oceananigans.jl may have some other problem too, related to passing @view indexing of arrays directly into the MPI calls. So good results for Oceananigans may require some other work too - which @glwagner is looking at.

Lots of details here! Perhaps some of us could zoom tomorrow after I have seen Barton? It might be good to do a little single rank profiling too. That would be useful and would help once we have CUDA meets OpenMPI meets Nvidia drivers back under control.

christophernhill commented 3 years ago

I have tried running the library ImplicitGlobalGrid.jlin 1, 2 and 4 gpus and have actually found rather bad scaling as well, 57 and 35 percent.

I have created an issue and hope they might have some suggestions as to how to improve the results. Maybe what I learn there might be transferable to Oceananigans?

@francispoulin (see my above comment). I think ImplictGlobalGrid.jl as downloaded is not configured to run across multiple GPUs. I added a line in a fork here ( https://github.com/christophernhill/ImplicitGlobalGrid.jl/blob/5e4fd0698b7087467d9314bfa253d6bc9a09a40a/diffusion3D_multigpu_CuArrays_novis.jl#L21 ) that is needed. With that I saw reasonable weak scaling - even with broken CUDA aware MPI support. Oceananigans.jl has some other things going on.

I agree profiling with nvprof/nsight would be great. This link https://github.com/mit-satori/getting-started/blob/master/tutorial-examples/nvprof-profiling/Satori_NVProf_Intro.pdf and this https://mit-satori.github.io/tutorial-examples/nvprof-profiling/index.html?highlight=profiling might be helpful to get started. The slides also have links to various NVidia bits of documentation.

francispoulin commented 3 years ago

I have tried running the library ImplicitGlobalGrid.jlin 1, 2 and 4 gpus and have actually found rather bad scaling as well, 57 and 35 percent. I have created an issue and hope they might have some suggestions as to how to improve the results. Maybe what I learn there might be transferable to Oceananigans?

This then sounds to me like you don't have a working CUDA-aware MPI. IGG should show >90% efficiency

Thanks @vchuravy . The runs for IGG were on a server that has CUDA-aware MPI, so that's not the problem. As @christophernhill points out, there are a lot of other possibilities though.

francispoulin commented 3 years ago

As I said, please post your slurp script and other environment options. It is impossible to debug otherwise.

I have an annotated slurp script here https://github.com/CliMA/ClimateMachine.jl/wiki/Satori-Cluster which is what I used a while back for GPU scaling tests. A mossconfigured MPI can easily manifest itself as scaling this poor.

@hennyg888 has been very busy this week so hasn't had a chance to response. The slurm script that he used was passed down from @christophernhill , and I will let him share that with you, but it might not happen until Monday.

But I suppose I should learn to start running stuff on Satori as that is something that everyone else can use and people understand the configuration. I'll try to do that on Monday.

francispoulin commented 3 years ago

I have tried running the library ImplicitGlobalGrid.jlin 1, 2 and 4 gpus and have actually found rather bad scaling as well, 57 and 35 percent. I have created an issue and hope they might have some suggestions as to how to improve the results. Maybe what I learn there might be transferable to Oceananigans?

@francispoulin (see my above comment). I think ImplictGlobalGrid.jl as downloaded is not configured to run across multiple GPUs. I added a line in a fork here ( https://github.com/christophernhill/ImplicitGlobalGrid.jl/blob/5e4fd0698b7087467d9314bfa253d6bc9a09a40a/diffusion3D_multigpu_CuArrays_novis.jl#L21 ) that is needed. With that I saw reasonable weak scaling - even with broken CUDA aware MPI support. Oceananigans.jl has some other things going on.

I agree profiling with nvprof/nsight would be great. This link https://github.com/mit-satori/getting-started/blob/master/tutorial-examples/nvprof-profiling/Satori_NVProf_Intro.pdf and this https://mit-satori.github.io/tutorial-examples/nvprof-profiling/index.html?highlight=profiling might be helpful to get started. The slides also have links to various NVidia bits of documentation.

Thanks @christophernhill for all this information. This will be most helpful. Unfortunately, tomorrow I am busy from 9am to 5pm so I don't think I can zoom, but maybe on Monday? I'll try and look into these resources before hand.

hennyg888 commented 3 years ago

Thank you very much @christophernhill ! What I'm running is indeed based of of https://github.com/christophernhill/onan-jcon2021-bits/blob/main/run/satori/run-on-bench-on-rhel7-satori-with-mpi . here's the salloc command used:

$ salloc  --mem=16G -n 5 --gres=gpu:3 -t 01:00:00
$ {ROOTDIR}/julia-1.6.2/bin/julia --project=. weak_scaling_shallow_water_model.jl

I also changed the line in weak_scaling_shallow_water_model.jl that launches weak_scaling_shallow_water_model_single.jl into this:

run(`srun --pty -n $R $julia --project=. weak_scaling_shallow_water_model_single.jl $(typeof(decomposition)) $Nx $Ny $Rx $Ry`)

glwagner commented 3 years ago

Might make sense to figure out how to @assert that the benchmark is configured correctly?

francispoulin commented 3 years ago

@christophernhill : I wanted to confirm that I took your clever idea of using select_device() and added that into my code. When I ran it on 1, 2 and 4 GPUs I was able to get efficiences of 97 percent. So the code is performing very well, and the server can be efficent on multi GPUs.

The link to where the function is defined is copied below. Is this something that is done automatically in Oceananigans through AbstractKernels.jl or something else?

https://github.com/christophernhill/ImplicitGlobalGrid.jl/blob/5e4fd0698b7087467d9314bfa253d6bc9a09a40a/src/select_device.jl

In chatting with the developers of ImplicitGlobalGrid.jl they mentioned that to get efficiency I should use something called @hide_communication in ParallelKernel.jl. Again, I don't pretend to understand what this does but wanted to share the information I was given.

https://github.com/omlins/ParallelStencil.jl/blob/main/src/ParallelKernel/hide_communication.jl

francispoulin commented 3 years ago

Another thought for @christophernhill

At the talk today on ImplicitGlobalGrid.jl, they were using @view in the simplest code but they dropped it as soon as they started to optimize the code. I believe they started using LazyArrays.jl. I don't know what it is but I suspect it doesn't have the problems that @view might have.

glwagner commented 3 years ago

Another thought for @christophernhill

At the talk today on ImplicitGlobalGrid.jl, they were using @view in the simplest code but they dropped it as soon as they started to optimize the code. I believe they started using LazyArrays.jl. I don't know what it is but I suspect it doesn't have the problems that @view might have.

We think that we cannot send non-contiguous data over MPI between GPUs (only CPUs). Thus certain views will not work. Possibly in this case the data is transferred to CPU, sent over MPI, and then copied back to the GPU (slow).

christophernhill commented 3 years ago

Another thought for @christophernhill

At the talk today on ImplicitGlobalGrid.jl, they were using @view in the simplest code but they dropped it as soon as they started to optimize the code. I believe they started using LazyArrays.jl. I don't know what it is but I suspect it doesn't have the problems that @view might have.

@francispoulin thanks. I think we probably just want to do some buffer. I looked at LazyArrays.jl and I could imagine how that could maybe also be included, but I suspect the main thing is having a buffer (which https://github.com/eth-cscs/ImplicitGlobalGrid.jl has). I don't see any sign of LazyArrays in https://github.com/eth-cscs/ImplicitGlobalGrid.jl code! We can check with Ludovic though.

francispoulin commented 3 years ago

We think that we cannot send non-contiguous data over MPI between GPUs (only CPUs). Thus certain views will not work. Possibly in this case the data is transferred to CPU, sent over MPI, and then copied back to the GPU (slow).

Interesting. This means that we can't really use CUDA-aware MPI, since that is basically to allow GPUs to communicate directl. This puts a limit in terms fo the efficiency but I think we can still get something decent up and running.

Can you give me any details as to why this is?

What would be required to fix this in the long term?

francispoulin commented 3 years ago

@francispoulin thanks. I think we probably just want to do some buffer. I looked at LazyArrays.jl and I could imagine how that could maybe also be included, but I suspect the main thing is having a buffer (which https://github.com/eth-cscs/ImplicitGlobalGrid.jl has). I don't see any sign of LazyArrays in https://github.com/eth-cscs/ImplicitGlobalGrid.jl code! We can check with Ludovic though.

Thanks for looking at this @christophernhill and sorry that I misquoted. At the JuliaCon talk yesterday, they started off talking about a simple repo and then ended up talking about ImplicitGlobalGrid.jl. The link I should have given was this.

If you think that buffering is the way to go then I'm certainly happy to give that a try. Maybe we can have a zoom meeting this week to discuss in more detail?

glwagner commented 3 years ago

We think that we cannot send non-contiguous data over MPI between GPUs (only CPUs). Thus certain views will not work. Possibly in this case the data is transferred to CPU, sent over MPI, and then copied back to the GPU (slow).

Interesting. This means that we can't really use CUDA-aware MPI, since that is basically to allow GPUs to communicate directl. This puts a limit in terms fo the efficiency but I think we can still get something decent up and running.

Can you give me any details as to why this is?

What would be required to fix this in the long term?

There's no limitation, we just have to send continguous data over MPI rather than non-contiguous data. We can do this by creating contiguous "buffer" arrays. The algorithm is 1. copy data from halos to buffer; 2. send buffer; 3. copy buffer to halo regions at receiving end.

francispoulin commented 3 years ago

Ah, that makes a lot of sense and sounds very doable. I am happy to help with this where I can but don't know the MPI stuff nearly as well as @christophernhill .

glwagner commented 3 years ago

All the MPI stuff is in the Distributed module:

https://github.com/CliMA/Oceananigans.jl/tree/master/src/Distributed

hennyg888 commented 3 years ago

Vastly increased multi-GPU efficiency by designating 1 GPU per process with CUDA.device!(local_rank) in the single case code, right after setting up MPI. https://cuda.juliagpu.org/stable/api/essentials/#CUDA.device!-Tuple{CuDevice} This was a page taken out of ImplicitGlobalGrid.jl's book, more specifically https://github.com/eth-cscs/ImplicitGlobalGrid.jl/blob/master/src/select_device.jl. Much better results:

size | ranks | min | median | mean | max | memory | allocs | samples -- | -- | -- | -- | -- | -- | -- | -- | -- (4096, 256) | (1, 1) | 2.702 ms | 2.728 ms | 2.801 ms | 3.446 ms | 2.03 MiB | 5535 | 10 (4096, 512) | (1, 2) | 3.510 ms | 3.612 ms | 4.287 ms | 16.546 ms | 2.03 MiB | 5859 | 20 (4096, 768) | (1, 3) | 3.553 ms | 3.653 ms | 5.195 ms | 39.152 ms | 2.03 MiB | 5859 | 30 size | ranks | slowdown | efficiency | memory | allocs -- | -- | -- | -- | -- | -- (4096, 256) | (1, 1) | 1.0 | 1.0 | 1.0 | 1.0 (4096, 512) | (1, 2) | 1.32399 | 0.755293 | 1.00271 | 1.05854 (4096, 768) | (1, 3) | 1.33901 | 0.746818 | 1.00271 | 1.05854

I could only get up to 3 GPUs because I'm still doing this on only one node. I will try to do more ranks and GPUs once more significant changes than my single line of code is added. This was done on Satori with setup instructions shown here:

https://github.com/christophernhill/onan-jcon2021-bits/blob/main/run/satori/run-on-bench-on-rhel7-satori-with-mpi The better efficiency is not caused by a slowdown on the non-MPI case either. Both this result and the original one posted above had median one-rank times of around 2.7ms.

System info:

Julia Version 1.6.2
Commit 1b93d53fc4 (2021-07-14 15:36 UTC)
Platform Info:
  OS: Linux (powerpc64le-unknown-linux-gnu)
  CPU: unknown
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-11.0.1 (ORCJIT, pwr9)
Environment:
  JULIA_MPI_PATH = /home/software/spack/openmpi/3.1.4-nhjzelonyovxks5ydtrxehceqxsbf7ik
  JULIA_CUDA_USE_BINARYBUILDER = false
  JULIA_DEPOT_PATH = /nobackup/users/henryguo/projects/henry-test/Oceananigans.jl/benchmark/.julia
  GPU: Tesla V100-SXM2-32GB

christophernhill commented 3 years ago

@hennyg888 good to see that helped.

I think there is a CUDA.versioninfo() ( https://github.com/JuliaGPU/CUDA.jl/blob/4985b0d5827f776683edb702ff296dcb59ba1097/src/utilities.jl#L42 ) function that would be useful to log along side System info:.

francispoulin commented 3 years ago

That is a huge leap forward @hennyg888 and great to see! Before we were at 50% and now we are at 75%. An increase of 50%, which is pretty huge all things considered.

I like @christophernhill 's suggest of adding the version info.

Yesterday when we talked the consensus was that one major problem was how we do buffering. As a silly experiment, what if we redo this without updating any halos, ever. Physically, it's going to be wrong but do we get another huge increase in the efficiency? If the efficiency gets close to 100% then in my mind that validate the hypothesis. If not, then that would signify there is another bottleneck that we need to hunt down.

ali-ramadhan commented 3 years ago

Apologies for not participating in this issue and for possibly being the cause of the issue via sending/receiving views...

If we have to send contiguous data we could just modify the underlying_*_boundary functions to convert the view into a contiguous array.

Receiving is done straight into the halo view (a trick(?) that seems to work nicely on the CPU), so we would probably need to create a new buffer of the right size to receive into and then copy it into the halo: https://github.com/CliMA/Oceananigans.jl/blob/master/src/Distributed/halo_communication.jl#L162-L166

Also not sure if relevant but I remember @hennyg888 and @francispoulin suggesting that placing an MPI.Barrier() at the end of each time step helped with a certain scaling benchmark?

glwagner commented 3 years ago

@ali-ramadhan I'm planning to pursue an abstraction wherein contiguous buffers are preallocated. It'd be great to discuss this!

ali-ramadhan commented 3 years ago

That would definitely be nice. Are you thinking of putting them inside the Multi{C,G}PU architectures?

hennyg888 commented 3 years ago

As suggested by @francispoulin, the following was commented out https://github.com/CliMA/Oceananigans.jl/blob/master/src/Models/ShallowWaterModels/update_shallow_water_state.jl#L19-L22 to remove filling halo regions and buffering between ranks. This gave perfect efficiency up to 3 ranks. This was mainly done to locate where possible bottlenecks are and not a legitimate change to the code. It was expected that buffering is what's causing efficiency decreases, and this confirms that there are no other additional undetected causes for efficiency drops.

size | ranks | slowdown | efficiency | memory | allocs -- | -- | -- | -- | -- | -- (4096, 256) | (1, 1) | 1.0 | 1.0 | 1.0 | 1.0 (4096, 512) | (1, 2) | 0.988079 | 1.01206 | 1.06328 | 1.0406 (4096, 768) | (1, 3) | 0.992832 | 1.00722 | 1.06328 | 1.0406

system environment and CUDA.versioninfo():

Oceananigans v0.60.0
Julia Version 1.6.2
Commit 1b93d53fc4 (2021-07-14 15:36 UTC)
Platform Info:
  OS: Linux (powerpc64le-unknown-linux-gnu)
  CPU: unknown
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-11.0.1 (ORCJIT, pwr9)
Environment:
  JULIA_MPI_PATH = /home/software/spack/openmpi/3.1.4-nhjzelonyovxks5ydtrxehceqxsbf7ik
  JULIA_CUDA_USE_BINARYBUILDER = false
  JULIA_DEPOT_PATH = /nobackup/users/henryguo/projects/henry-test/.julia
  GPU: Tesla V100-SXM2-32GB

CUDA toolkit 10.1.243, local installation
CUDA driver 10.2.0
NVIDIA driver 440.64.0
Libraries: 
- CUBLAS: 10.2.2
- CURAND: 10.1.1
- CUFFT: 10.1.1
- CUSOLVER: 10.2.0
- CUSPARSE: 10.3.0
- CUPTI: 12.0.0
- NVML: 10.0.0+440.64.0
- CUDNN: missing
- CUTENSOR: missing
Toolchain:
- Julia: 1.6.2
- LLVM: 11.0.1
- PTX ISA support: 3.2, 4.0, 4.1, 4.2, 4.3, 5.0, 6.0, 6.1, 6.3, 6.4
- Device capability support: sm_30, sm_32, sm_35, sm_37, sm_50, sm_52, sm_53, sm_60, sm_61, sm_62, sm_70, sm_72, sm_75
Environment:
- JULIA_CUDA_USE_BINARYBUILDER: false
3 devices:
  0: Tesla V100-SXM2-32GB (sm_70, 31.738 GiB / 31.749 GiB available)
  1: Tesla V100-SXM2-32GB (sm_70, 31.738 GiB / 31.749 GiB available)
  2: Tesla V100-SXM2-32GB (sm_70, 31.738 GiB / 31.749 GiB available)

francispoulin commented 3 years ago

Thanks @hennyg888 for confirming this. The result is as it should be and I think you have confirmed that when we get the buffer working for the MPI, that should drasatically improve the scaling on multi GPUs.

glwagner commented 3 years ago

Thanks @hennyg888 for confirming this. The result is as it should be and I think you have confirmed that when we get the buffer working for the MPI, that should drasatically improve the scaling on multi GPUs.

I guess I'd say that we have confirmed its the MPI communication / halo filling that causes a drop in efficiency. Next we have to figure out if we can design a communication system that's efficient! Contiguous buffers are promising but not guaranteed I think.

glwagner commented 2 years ago

@kpamnany might be interested in this issue.

glwagner commented 1 year ago

Since we use buffered communication, this is solved.

CliMA / Oceananigans.jl

Multi GPU scaling is very poor #1882