Question about performance (mainly cuda code)

afernandezody commented 4 years ago

Hello, I'm performing some tests (problem 1) with the MPI versions but the results are not as expected. Increasing the number of MPI ranks does not change the simulation parameters (e.g. dofs or cg iterations) as far as I can tell, but the job is split into the number of ranks. Then, the computational times multiply (in some cases roughly by the number of ranks) and the rates decrease. I was wondering if this is the expected behavior or if the problem size should increase according to the number of ranks in which case I'd be missing something.
Thanks.

vladotomov commented 4 years ago

Hi @afernandezody,

Just increasing the number of MPI ranks should not change the problem size or its parameters. Generally, if you have not crossed over a certain the parallel performance limit of your machine, the computational times should decrease, and the rates should increase.

Let us know if we can help further.

Vladimir

afernandezody commented 4 years ago

P.S. You might have noticed that the rates for L2 sometimes come back as negative (CPU-only version).

CG (H1) total time: 689.9712326320
CG (H1) rate (megadofs x cg_iterations / second): 131.1659556425

CG (L2) total time: 46.6173648610
CG (L2) rate (megadofs x cg_iterations / second): -13.9148862218

Forces total time: 42.9381951400
Forces rate (megadofs x timesteps / second): 185.9144871360

UpdateQuadData total time: 180.6026046180
UpdateQuadData rate (megaquads x timesteps / second): 58.6854441353

Major kernels total time (seconds): 909.4371813490
Major kernels total rate (megadofs x time steps / second): 119.9448570205

Energy  diff: 5.21e-02

afernandezody commented 4 years ago

Hi Vladimir, I've edited my previous comments to reflect further testing and more concise language. The CPU-only version seems to be working fine (with the minor hiccup of the occasional negative rate), and better scalability is a function of choosing the right parameters (see the attachment for results at low levels of refinement). However, the cuda version is still not behaving as expected: i) The scalability is just not there as using more GPUs always increase computational times and decrease rates; playing around with the refinement levels did not help. ii) Activating Unified Memory increases computational times
iii) Running the problem (1) with the parameters -rs 2 & -rp 5 results in the code crashing. After 19,510 time-steps, the code begins to repeat the steps over and over until it crashes with the error:

MFEM abort: The time step crashed!
 ... in function: int main(int, char**)
 ... in file: laghos.cpp:457

Thank you, Arturo laghos_CPU_scalability_low_refinement.txt

vladotomov commented 4 years ago

Hi Arturo,

Could you give me the command line that produces the negative rate on CPU?

To get a sense what GPU tests we perform for Laghos, take a look at pages 12 and 13 of https://ceed.exascaleproject.org/docs/ceed-ms32-report.pdf

Could you also send me the command line that causes the time step crash?

Thank you!

afernandezody commented 4 years ago

Hello Vladimir, I got the negative rates running on just 16 ranks (using the SGE scheduler) with the options: [mpirun] -p 1 -m $HOME/###/Laghos-cts2/data/square01_quad.mesh -rs 4 -rp 2 -o -q -ra [mpirun] -p 1 -m $HOME/###/Laghos-cts2/data/square01_quad.mesh -rs 2 -rp 5 -o -q -ra The code crashed on a single K-80 with: ./laghos -aware -p 1 -m $HOME/###/Laghos/data/square01_quad.mesh -rs 2 -rp 5 -tf 0.8 -pa -cfl 0.05 Going down to rp 4 fixed the issue. Thank you for the report!

afernandezody commented 4 years ago

Hi Vladimir, I'm a bit concerned about the orders of magnitude and that that might be causing the issues when using several GPUs. Looking at Fig. 13 in the report, two things stand out: (1) the rate is given in gigadofs x CG(H1-iter)/s whereas my output rate is in megadofs x CG(H1-iter)/s; (2) the legend indicates 8M and 32M, which I'm interpreting as around 8 and 32 millions dofs, whereas the size of the problem for 'mpirun -np 4 laghos -p 1 -m ../data/square01_quad.mesh -rs 3 -tf 0.8 -pa -cfl 0.05' is 2,178 (number of kinematic dofs) and 1,024 (number of specific internal energy dofs). Furthermore, the mesh shown in the report (Fig. 11) is substantially finer than the one shown in GitHub. My question boils down to how to carry out the simulations shown in the report (mainly the ones marked as 8M and 32M) because that is precisely what I'm trying to accomplish (measuring times and rate at increasing number of GPUs). Would I need to modify any file or is it possible to simply provide further parameters from the command line?
Thanks, Arturo

camierjs commented 4 years ago

Hi Arturo,

The runs in the report have not been done with the version found in the cuda subdir of Laghos.

The cts2 version of Laghos is based on MFEM 4.0 and the runs in the CEED report also follow this effort to migrate and use the abstraction of the library.

We used the following command line: ./laghos -p 6 -d cuda -o -q -ok 4 -ot 3 -m data/square01_quad.mesh -rs 4 -rp 7 -tf 0.3 on 256 nodes, and used the okina branch of Laghos and the laghos branch of MFEM.

We can help you with these branches as they are going to be merged soon.

Jean-Sylvain

afernandezody commented 4 years ago

Hi Jean-Sylvain, It seems like I'm going to need a little bit of help to compile this version as my attempts keep failing. I haven't touched the branch laghos-v2.0 (assuming this is the right fit for cts2) for MFEM so the main change is to use the 'okina' branch for Laghos. The branches okina-mmu and okina-nvidia are also in the GitHub tree but I didn't touch them. The compilation output results in almost 500 lines with several errors so the box just shows the ones at the top and bottom:

In file included from gridFuncToQuadS.cpp:16:0:
../cuda.hpp:56:1: error: ‘__device__’ does not name a type; did you mean ‘CUdevice’?
 __device__ __forceinline__ void mallocBuf(void** ptr, void** buf_ptr, int size)
 ^~~~~~~~~~
 CUdevice
In file included from qDataUpdateS.cpp:16:0:
../cuda.hpp:56:1: error: ‘__device__’ does not name a type; did you mean ‘CUdevice’?
 __device__ __forceinline__ void mallocBuf(void** ptr, void** buf_ptr, int size)
 ^~~~~~~~~~
 CUdevice
In file included from ../cuda.hpp:43:0,
                 from gridFuncToQuadS.cpp:16:
../include/forall.hpp:26:16: error: ‘__global__’ does not name a type; did you mean ‘__locale_t’?
 #define kernel __global__
                ^
gridFuncToQuadS.cpp:21:33: note: in expansion of macro ‘kernel’
          const int NUM_QUAD_1D> kernel
                                 ^~~~~~
In file included from ../cuda.hpp:43:0,
                 from qDataUpdateS.cpp:16:
../include/forall.hpp:26:16: error: ‘__global__’ does not name a type; did you mean ‘__locale_t’?
 #define kernel __global__
                ^
qDataUpdateS.cpp:22:33: note: in expansion of macro ‘kernel’
          const int NUM_DOFS_1D> kernel
                                 ^~~~~~
../include/forall.hpp:26:16: error: ‘__global__’ does not name a type; did you mean ‘__locale_t’?
 #define kernel __global__

...

../include/forall.hpp:34:57: note: in definition of macro ‘call0’
 #define call0(id,grid,blck,...) call[id]<<<grid,blck>>>(__VA_ARGS__)
                                                         ^~~~~~~~~~~
qDataUpdateS.cpp:670:14: warning: unused variable ‘b1d’ [-Wunused-variable]
    const int b1d = (NUM_QUAD_1D<NUM_DOFS_1D)?NUM_DOFS_1D:NUM_QUAD_1D;
              ^~~
makefile:197: recipe for target 'cuda/kernels/share/gridFuncToQuadS.o' failed
make: *** [cuda/kernels/share/gridFuncToQuadS.o] Error 1
make: *** Waiting for unfinished jobs....
makefile:197: recipe for target 'cuda/kernels/share/qDataUpdateS.o' failed
make: *** [cuda/kernels/share/qDataUpdateS.o] Error 1

I was also wondering if this version requires any other package such as OCCA or PETSc. Thanks, Arturo

camierjs commented 4 years ago

Hi Arturo,

Here is the link to build and test for the CPU side: CTS2-Laghos-Benchmark-190507.pdf.

Now for the GPU side, which is still in development in order to sync both MFEM and Laghos, you need to use: the okina branch of Laghos with the laghos branch of MFEM.

You can do a make distclean && make pcuda -j in MFEM, make sure that Laghos is using this version of MFEM, and do a make clean && make -j in the root directory of Laghos.

One quick test can be made by typing: make 1, the output should be like:

OK: laghos1-p0-square01_quad-o-q
OK: laghos1-p5-square01_quad-o-q
OK: laghos1-p4-square01_quad-o-q
OK: laghos1-p3-square01_quad-o-q
OK: laghos1-p6-square01_quad-o-q
OK: laghos1-p2-square01_quad-o-q
OK: laghos1-p1-square01_quad-o-q

Now you'll have a version of Laghos with the right options, like in the pdf file, but will also have access to the cuda device by using the -o -q -d cuda option: ./laghos -p 1 -m data/square01_quad.mesh -rs 3 -o -q -f -d cuda.

Tell me if that works for you. Jean-Sylvain

afernandezody commented 4 years ago

Hi Jean-Sylvain, It's still not clicking for me. Firstly, I'm not sure what 'pcuda' does or how it differs from cuda. Anyway, this is a summary of the attempts for building MFEM: 1) If I use the 'laghos-v2.0' branch for MFEM, the error message reads

______@ip-172-31-13-234:~/GGG/mfem$ make pcuda -j 4 CUDA_ARCH=sm_37
make: *** No rule to make target 'pcuda'.  Stop.

Cleaning up everything and switching to the 'laghos' branch for MFEM results in the following errors: 2) If compiling with $make pcuda -j 4 CUDA_ARCH=sm_37

______@ip-172-31-13-234:~/GGG/mfem$ make pcuda -j 4 CUDA_ARCH=sm_37
make -f makefile config MFEM_USE_MPI=YES MFEM_DEBUG=NO \
   MFEM_USE_CUDA=YES CUDA_ARCH=sm_37
make[1]: Entering directory '/home/______/GGG/mfem'
make -C config all
make[2]: Entering directory '/home/______/GGG/mfem/config'
nvcc -I../../hypre/src/hypre/include get_hypre_version.cpp -o get_hypre_version
get_hypre_version.cpp:12:10: fatal error: HYPRE_config.h: No such file or directory
 #include "HYPRE_config.h"
          ^~~~~~~~~~~~~~~~
compilation terminated.
makefile:47: recipe for target 'get_hypre_version' failed
make[2]: *** [get_hypre_version] Error 1
make[2]: Leaving directory '/home/______/GGG/mfem/config'
makefile:576: recipe for target 'local-config' failed
make[1]: *** [local-config] Error 2
make[1]: Leaving directory '/home/______/GGG/mfem'
makefile:462: recipe for target 'pcuda' failed
make: *** [pcuda] Error 2

The status is:

make status
MFEM_VERSION           = 40001 [v4.0.1]
MFEM_GIT_STRING        = heads/laghos-0-ge5baf04465172b7c4c3e038bcd2eee23926c54e6
MFEM_USE_MPI           = YES
MFEM_USE_METIS         = YES
MFEM_USE_METIS_5       = YES
MFEM_DEBUG             = NO
MFEM_USE_EXCEPTIONS    = NO
MFEM_USE_GZSTREAM      = NO
MFEM_USE_LIBUNWIND     = NO
MFEM_USE_LAPACK        = NO
MFEM_THREAD_SAFE       = NO
MFEM_USE_OPENMP        = NO
MFEM_USE_LEGACY_OPENMP = NO
MFEM_USE_MEMALLOC      = YES
MFEM_TIMER_TYPE        = 2
MFEM_USE_SUNDIALS      = NO
MFEM_USE_MESQUITE      = NO
MFEM_USE_SUITESPARSE   = NO
MFEM_USE_SUPERLU       = NO
MFEM_USE_STRUMPACK     = NO
MFEM_USE_GECKO         = NO
MFEM_USE_GNUTLS        = NO
MFEM_USE_NETCDF        = NO
MFEM_USE_PETSC         = NO
MFEM_USE_MPFR          = NO
MFEM_USE_SIDRE         = NO
MFEM_USE_CONDUIT       = NO
MFEM_USE_PUMI          = NO
MFEM_USE_CUDA          = NO
MFEM_USE_HIP           = NO
MFEM_USE_RAJA          = NO
MFEM_USE_OCCA          = NO
MFEM_CXX               = mpicxx
MFEM_CPPFLAGS          =
MFEM_CXXFLAGS          = -O3 -std=c++11
MFEM_TPLFLAGS          = -I/home/______/GGG/mfem/../hypre-2.11.2/src/hypre/include -I/home/______/GGG/mfem/../metis-5.1.0/include
MFEM_INCFLAGS          = -I$(MFEM_INC_DIR) $(MFEM_TPLFLAGS)
MFEM_FLAGS             = $(MFEM_CPPFLAGS) $(MFEM_CXXFLAGS) $(MFEM_INCFLAGS)
MFEM_LINK_FLAGS        = -O3 -std=c++11 -I. -I/home/______/GGG/mfem/../hypre-2.11.2/src/hypre/include -I/home/______/GGG/mfem/../metis-5.1.0/include
MFEM_EXT_LIBS          = -L/home/______/GGG/mfem/../hypre-2.11.2/src/hypre/lib -lHYPRE -L/home/______/GGG/mfem/../metis-5.1.0/lib -lmetis -lrt
MFEM_LIBS              = -L$(MFEM_LIB_DIR) -lmfem $(MFEM_EXT_LIBS)
MFEM_LIB_FILE          = $(MFEM_LIB_DIR)/libmfem.a
MFEM_BUILD_TAG         = Linux ip-172-31-13-234 x86_64
MFEM_PREFIX            = ./mfem
MFEM_INC_DIR           = $(MFEM_DIR)
MFEM_LIB_DIR           = $(MFEM_DIR)
MFEM_STATIC            = YES
MFEM_SHARED            = NO
MFEM_BUILD_DIR         = .
MFEM_MPIEXEC           = mpirun
MFEM_MPIEXEC_NP        = -np
MFEM_MPI_NP            = 4

I think that the 2nd option is the closest to what you suggested, but the 'pcuda' seems to not recognize hypre, not sure why. Thanks, Arturo

camierjs commented 4 years ago

pcuda is the shortcut for parallel cuda.

You should follow the Readme in order to compile Metis, Hypre and MFEM at the same level in some directory: this should allow you to be able to compile the parallel version of MFEM.

Here is a script that will help you to start: build_okina.txt. You might need to change the CUDA_ARCH in it, but should end up with a good mfem/hypre/metis/laghos setup.

Jean-Sylvain

afernandezody commented 4 years ago

Hello Jean-Sylvain, It seems to be finally compiling and the last step is to check out how the code scales up, which was the origin of the thread. Two things were required to compile the okina branch of laghos-cuda: 1) move the hypre.version directory to hypre; 2) use v4.0.3 for metis (v5.1.0 did not work). The thing with the readme files is that they do not always agree. For example, the readme files at GitHub (Laghos/cuda) and the download from the okina download both suggest the use of the laghos-v2.0 branch rather simply laghos, which didn't work for the okina branch. The pointers to the metis version look like fine (v5.1.0 for the master branch and v4.0.3 for the okina branch). On a minor note, I didn't understand some of the suggested execution flags (-d cuda -o -q). Thanks, Arturo

afernandezody commented 4 years ago

Hi Jean-Sylvain, I thought that this was under control as it was compiling and started the simulation. However, it's stopping with the following error:

MFEM abort: Bad number given for problem id!
 ... in function: void mfem::hydrodynamics::v0(const mfem::Vector&, mfem::Vector&)
 ... in file: laghos.cpp:696

I couldn't find any thread of someone reporting a similar issue, but it looks like something related to problem 6 or meshing error. It happens while trying to run on 8 K80s with the command: [mpirun -np 8 laghos] -aware -p 6 -ok 4 -ot 3 -m ../data/square01_quad.mesh -rs 4 -rp 7 -tf 0.3, which is slightly different from the previously suggested one.
Thanks, Arturo

camierjs commented 4 years ago

Hi Arturo,

Please make sure to stay at the root directory of Laghos: when I see the ../data/square01_quad.mesh, that means you're one level too deep and the aware option is only for the cuda subdirectory.

If you stay at the top level, you should be able to run: mpirun -n 2 ./laghos -p 6 -ok 4 -ot 3 -m ./data/square01_quad.mesh -rs 1 -rp 1 -o -q -tf 0.3 for a cpu run and mpirun -n 2 ./laghos -p 6 -ok 4 -ot 3 -m ./data/square01_quad.mesh -rs 1 -rp 1 -o -q -d cuda -gam -tf 0.3 for a cuda run.

Jean-Sylvain

afernandezody commented 4 years ago

Jean-Sylvain, I'm submitting the job via scheduler and the script includes the whole path (I simply shortened it for brevity in the post, sorry for the confusion). The software took the mesh properties w/o any issue. The output file reads:

       __                __                 
      / /   ____  ____  / /_  ____  _____   
     / /   / __ `/ __ `/ __ \/ __ \/ ___/ 
    / /___/ /_/ / /_/ / / / / /_/ (__  )    
   /_____/\__,_/\__, /_/ /_/\____/____/  
               /____/                       

Options used:
   --mesh /home/ubuntu/GGG/Laghos/data/square01_quad.mesh
   --refine-serial 4
   --refine-parallel 7
   --problem 6
   --order-kinematic 4
   --order-thermo 3
   --ode-solver 4
   --t-final 0.3
   --cfl 0.5
   --cg-tol 1e-08
   --cg-max-steps 300
   --max-steps -1
   --partial-assembly
   --no-impose-viscosity
   --no-visualization
   --visualization-steps 5
   --no-visit
   --no-print
   --outputfilename results/Laghos
   --no-uvm
   --aware
   --no-hcpo
   --no-sync
   --no-share
   --no-checks
[32m[laghos] MPI [1mIS CUDA aware[m
[32m[laghos] CUDA device count: 1[m
[32m[laghos] Rank_0 => Device_0 (Tesla K80:sm_3.7)[m
[32m[laghos] Rank_7 => Device_0 (Tesla K80:sm_3.7)[m
[32m[laghos] Rank_1 => Device_0 (Tesla K80:sm_3.7)[m
[32m[laghos] Rank_6 => Device_0 (Tesla K80:sm_3.7)[m
[32m[laghos] Rank_4 => Device_0 (Tesla K80:sm_3.7)[m
[32m[laghos] Rank_5 => Device_0 (Tesla K80:sm_3.7)[m
[32m[laghos] Rank_3 => Device_0 (Tesla K80:sm_3.7)[m
[32m[laghos] Rank_2 => Device_0 (Tesla K80:sm_3.7)[m
[32m[laghos] Non-Cartesian partitioning through METIS will be used[m
[32m[laghos] pmesh->GetNE()=125[m
[32m[laghos] pmesh->GetNE()=125[m
[32m[laghos] pmesh->GetNE()=125[m
[32m[laghos] pmesh->GetNE()=125[m
[32m[laghos] pmesh->GetNE()=125[m
[32m[laghos] pmesh->GetNE()=125[m
[32m[laghos] pmesh->GetNE()=125[m
[32m[laghos] pmesh->GetNE()=125[m
Number of kinematic (position, velocity) dofs: 536936450
Number of specific internal energy dofs: 268435456

After completing these operations, it's when it stops and outputs the above error message.

camierjs commented 4 years ago

This is again the cuda version of Laghos in the cuda subdir.

With this branch, you don't need to use it: it will be removed in the next version.

The idea is to have the 'top' laghos executable able to target all the different backends from the same source code.

That's why you should have access to the -o -q -d cuda options that let you use your CUDA device.

afernandezody commented 4 years ago

Hi Jean-Sylvain, Yesterday, and in the hurry to get the code running and close the ticket, I couldn't compile in the Laghos subdirectory (with the error "makefile:202: *** non-numeric first argument to 'word' function: '{1..1}'. Stop." that I had never seen before, so I switched to the cuda subdirectory and was able to compile (the system was telling me that it was still on the 'okina' branch). My mistake was probably not to troubleshoot that error. Thanks, Arturo

afernandezody commented 4 years ago

Hello Jean-Sylvain, While trying to recompile the code twice earlier (one of them from a clean sheet by reinstalling the software components), the compilation from the laghos directory is still failing with the error: makefile:202: *** non-numeric first argument to 'word' function: '{1..1}'. Stop. This line seems to refer to the definition of the mfem_test_example (maybe a wrong type of argument or parameter). Could you confirm this error is not happening at your system?
Thanks, Arturo

camierjs commented 4 years ago

Hi Arturo,

I don't have this error, it might be specific to the make version.

I've added another option to the options by default: please update and tell me if it helps.

afernandezody commented 4 years ago

Hi Jean-Sylvain, Still getting the error albeit with a slight change: makefile:202: *** non-numeric first argument to 'word' function: '{1..3}'. Stop.

Also, some info about my make:

make --version
GNU Make 4.1
Built for x86_64-pc-linux-gnu
Copyright (C) 1988-2014 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.

camierjs commented 4 years ago

Hi Arturo,

I changed the makefile options so you can make a new try.

afernandezody commented 4 years ago

Hi Jean-Sylvain, I was able to compile yesterday night and run a serial job in one GPU w/o any issue. However, I'm running into problems when trying to run MPI jobs this morning. The code keeps running into segmentation faults (never a good thing!) while generating the partitions and mapping. Also, a 4 MPI ranks job seems to generate a problem with a size of about 190M dofs (it feels almost if the problem size kept growing and growing for some reason). I need to troubleshoot this in a systematic way, but it probably has to wait until Monday. I'll post an update once it's a bit more clear what is going on.
Thanks, Arturo P.S. The GPU version is still generating many more DOFs than I was anticipating. On Monday, I'll build the CPU-only version and try to isolate whether the errors that I'm seeing are caused by the gpu-aware-mpi or if there is something else.

afernandezody commented 4 years ago

Hello Jean-Sylvain, I apologize for posting this so late. I had some trouble compiling this morning -probably related to one of the compilers- but was able to build the OKINA (non-CUDA) MPI version for CPU-based clusters, which works great and is able to run problem 6 (the cts2 version cannot but you're probably aware of that). Anyhow, as I was suspecting on Saturday, there is something funny with the pcuda version and the DOFs. I'm listing the DOFs for the CPU-only version: laghos -p 6 -ok 4 -ot 3 -m $LaghosHomeDirectory/data/square01_quad.mesh -rs 1 -rp 1 -o -q -tf 0.3-> 2178/2178 1024/1024 laghos -p 6 -ok 4 -ot 3 -m $LaghosHomeDirectory/data/square01_quad.mesh -rs 2 -rp 1 -o -q -tf 0.3-> 4306/8450 2048/4096 laghos -p 6 -ok 4 -ot 3 -m $LaghosHomeDirectory/data/square01_quad.mesh -rs 2 -rp 3 -o -q -tf 0.3-> 66370/132098 32768/65536 laghos -p 6 -ok 4 -ot 3 -m $LaghosHomeDirectory/data/square01_quad.mesh -rs 2 -rp 4 -o -q -tf 0.3-> 1051906/2101250 524288/1048576 laghos -p 6 -ok 4 -ot 3 -m $LaghosHomeDirectory/data/square01_quad.mesh -rs 2 -rp 6 -o -q -tf 0.3-> 2101250/8396802 1048576/41964304 Not sure about the last one but let me move on to the DOFs with the GPU: laghos -p 6 -ok 4 -ot 3 -m $LaghosHomeDirectory/data/square01_quad.mesh -rs 1 -rp 1 -o -q -d cuda -gam -tf 0.3 -> 2178/2178 1024/1024 laghos -p 6 -ok 4 -ot 3 -m $LaghosHomeDirectory/data/square01_quad.mesh -rs 2 -rp 1 -o -q -d cuda -gam -tf 0.3 -> 8450/8450 4096/4096 laghos -p 6 -ok 4 -ot 3 -m $LaghosHomeDirectory/data/square01_quad.mesh -rs 2 -rp 3 -o -q -d cuda -gam -tf 0.3 -> 132098/132098 laghos -p 6 -ok 4 -ot 3 -m $LaghosHomeDirectory/data/square01_quad.mesh -rs 2 -rp 4 -o -q -d cuda -gam -tf 0.3 -> 526378/526378 262144/262144 laghos -p 6 -ok 4 -ot 3 -m $LaghosHomeDirectory/data/square01_quad.mesh -rs 2 -rp 6 -o -q -d cuda -gam -tf 0.3 -> 8396902/8396902 4194304/4194304 I'm unsure why of the same number of DOFs for position and velocity (not a 1D problem) and, when I tried -rs 4 -rp 7, the number of DOF soared to hundreds of millions. I hope not to be missing something but this is what I got. Merry Christmas (if you celebrate it) or Happy Holidays, Arturo

afernandezody commented 4 years ago

Hello Jean Sylvain, I tried recompiling Laghos with the changes made in mfem/mfem (laghos branch) but got several compilation errors. I was wondering if the plan is still for the Laghos-PCUDA version to run based on the Okina branch with mfem/laghos. Thanks. Arturo

camierjs commented 4 years ago

Hi Arturo, Yes, you are right: I am waiting on a couple of PR to go into MFEM master for the okina branch of Laghos to be merged. Best, Jean-Sylvain

afernandezody commented 4 years ago

Hi Jean-Sylvain, Thanks. Looking forward to trying it. Arturo

afernandezody commented 4 years ago

Hi Jean Sylvain, I just wanted to ask about the status of the pcuda version. My last attempt was about 3 weeks ago, when most everything was working except for simulations running on several GPUs. I noticed that you made several changes to the mfem/laghos branch and was wondering if the final changes have been completed. Thanks. Arturo

camierjs commented 4 years ago

Hi Arturo,

We are working at finishing the integration of the okina branch into master.

You should be able to reproduce these kind of results I ran yesterday on some V100:

- (1 MPI, 1 GPU): ./laghos -f -mb -ms 4 -o -q -d cuda -rs 3 -rp 2 -c '1 1 1'
- (2 MPI, 2 GPU): ./laghos -f -mb -ms 4 -o -q -d cuda -rs 3 -rp 2 -c '1 1 2'
- (4 MPI, 4 GPU): ./laghos -f -mb -ms 4 -o -q -d cuda -rs 3 -rp 2 -c '1 2 2'
- (8 MPI, 8 GPU): ./laghos -f -mb -ms 4 -o -q -d cuda -rs 3 -rp 2 -c '2 2 2'

| Ranks |  Zones | H1 dofs | L2 dofs | QP |   N dofs |     FOM1 |    T1 |      FOM2 |    T2 |     FOM3 |    T3 |      FOM |    TT |
|-------+--------+---------+---------+----+----------+----------+-------+-----------+-------+----------+-------+----------+-------|
|     1 | 262144 | 6440067 | 2097152 | 64 | 31754502 | 1174.845 | 2.653 | 11235.559 | 0.015 |  497.438 | 0.708 | 1078.032 | 3.377 |
|     2 | 262144 | 6440067 | 2097152 | 64 | 31754502 | 1907.814 | 1.634 | 13186.858 | 0.013 |  884.474 | 0.398 | 1808.965 | 2.012 |
|     4 | 262144 | 6440067 | 2097152 | 64 | 31754502 | 3168.475 | 0.984 | 11110.117 | 0.015 | 1343.503 | 0.262 | 2936.045 | 1.240 |
|     8 | 262144 | 6440067 | 2097152 | 64 | 31754502 | 4745.724 | 0.657 | 13119.221 | 0.013 | 2375.375 | 0.148 | 4560.312 | 0.798 |

I'm using a MPI launcher that binds one GPU per MPI rank.

afernandezody commented 4 years ago

Hello Jean Sylvain, Your timing is excellent as I was planning on configuring/testing some systems tomorrow and Saturday. I'll let you know if everything works fine (the objective is to try running it in up to 128 GPUs). Hope that current events are not affecting you. Best, Arturo

afernandezody commented 4 years ago

Hi, I just finished building Laghos but I'm getting the error:

 mpirun -np 1 ./laghos -f -mb -ms 4 -o -q -d cuda -rs 3 -rp 2 -c '1 1 1'

       __                __
      / /   ____  ____  / /_  ____  _____
     / /   / __ `/ __ `/ __ \/ __ \/ ___/
    / /___/ /_/ / /_/ / / / / /_/ (__  )
   /_____/\__,_/\__, /_/ /_/\____/____/
               /____/

Wrong option format: cuda -rs

Obviously, I'm not using any launcher yet but the command line. I probably won't have time to troubleshoot until tomorrow, but I'm posting the error in case you have any comment. Thanks, Arturo

camierjs commented 4 years ago

My bad, I added the dim option that got mixed with the device one.

camierjs commented 4 years ago

Hi Arturo,

With the new version, you should remove the -o and -q options we were using so far.

Here are few command lines used to create the table below:

  - bsub -N  1 lrun -T1 ./laghos -f -mb -ms 4 -d cuda -rs 3 -rp 2 -c '1 1 1'
  - bsub -N  1 lrun -T2 ./laghos -f -mb -ms 4 -d cuda -rs 3 -rp 2 -c '1 1 2'
  - bsub -N  1 lrun -T4 ./laghos -f -mb -ms 4 -d cuda -rs 3 -rp 2 -c '1 2 2'
  - bsub -N  2 lrun -T4 ./laghos -f -mb -ms 4 -d cuda -rs 3 -rp 2 -c '2 2 2'
  - bsub -N  4 lrun -T4 ./laghos -f -mb -ms 4 -d cuda -rs 3 -rp 2 -c '4 2 2'
  - bsub -N  8 lrun -T4 ./laghos -f -mb -ms 4 -d cuda -rs 3 -rp 2 -c '4 4 2'
  - bsub -N 16 lrun -T4 ./laghos -f -mb -ms 4 -d cuda -rs 3 -rp 2 -c '4 4 4'

I kept the bsub/lrun to show that you must use your job launcher to bind one MPI rank to one different GPU.

| Ranks |  Zones | H1 dofs | L2 dofs |   N dofs |     FOM1 |    T1 |      FOM2 |    T2 |      FOM3 |    T3 |      FOM |    TT | 
|     1 | 262144 | 6440067 | 2097152 | 31754502 | 1172.960 | 2.657 | 11120.021 | 0.015 |   496.348 | 0.710 | 1076.127 | 3.383 |
|     2 | 262144 | 6440067 | 2097152 | 31754502 | 1909.689 | 1.632 | 12971.943 | 0.013 |   847.317 | 0.416 | 1793.159 | 2.030 | 
|     4 | 262144 | 6440067 | 2097152 | 31754502 | 3153.185 | 0.989 | 10602.253 | 0.016 |  1391.092 | 0.253 | 2930.214 | 1.242 |
|     8 | 262144 | 6440067 | 2097152 | 31754502 | 4672.727 | 0.667 | 10950.386 | 0.016 |  2318.458 | 0.152 | 4493.076 | 0.810 |
|    16 | 262144 | 6440067 | 2097152 | 31754502 | 6317.932 | 0.493 | 19868.627 | 0.009 |  4316.859 | 0.082 | 6368.615 | 0.572 |
|    32 | 262144 | 6440067 | 2097152 | 31754502 | 7095.817 | 0.439 | 21718.299 | 0.008 |  7612.594 | 0.046 | 7481.201 | 0.487 |
|    64 | 262144 | 6440067 | 2097152 | 31754502 | 7807.347 | 0.399 | 27751.121 | 0.006 | 12695.731 | 0.028 | 8479.957 | 0.429 |

Hope that will help you!

Jean-Sylvain

afernandezody commented 4 years ago

Hi Jean-Sylvain, I probably won't have time to begin testing until tomorrow but the okina branch has been absent from the Github tree for a few days. Does the installation require this branch or is it supposed to be now done from the master? Thanks, Arturo

camierjs commented 4 years ago

It has been merged, yes: we should be able to reproduce these results from it directly now.

afernandezody commented 4 years ago

Hello Jean-Sylvain, The new compilation went through but the code is breaking with a segmentation fault error (posted below). Let me explain the steps that I followed before asking a couple of specific questions. The installation includes hypre v2.11.2 & metis v4.0.3, mfem (main branch) and Laghos (main branch). As far as mfem, I'm still compiling with make pcuda -j 4 CUDA_ARCH=sm_37 as this first system uses K80s (the deployment of the system with V100s takes some time and will probably not be ready until the weekend). Building Laghos follows standard procedure make -j 4. I have tried several commands (e.g. mpirun -np 1 ./laghos -f -mb -ms 4 -d cuda -rs 3 -rp 2 -c '1 1 1'), and decreasing the number of parallel & serial refinements, but all of them result in similar errors. It's a bit unclear to me the significance of 'lrun -T1' in your script (as the compiler is not able to interpret it) so are these parameters telling the app which problem to solve or are they something else ('help' doesn't include them)? Thanks. Arturo

       __                __
      / /   ____  ____  / /_  ____  _____
     / /   / __ `/ __ `/ __ \/ __ \/ ___/
    / /___/ /_/ / /_/ / / / / /_/ (__  )
   /_____/\__,_/\__, /_/ /_/\____/____/
               /____/

Options used:
   --dimension 3
   --mesh default
   --refine-serial 3
   --refine-parallel 2
   --cartesian-partitioning '1 1 1'
   --problem 1
   --order-kinematic 2
   --order-thermo 1
   --order-intrule -1
   --ode-solver 4
   --t-final 0.6
   --cfl 0.5
   --cg-tol 1e-08
   --ftz-tol 0
   --cg-max-steps 300
   --max-steps 4
   --partial-assembly
   --no-impose-viscosity
   --no-visualization
   --visualization-steps 5
   --no-visit
   --no-print
   --outputfilename results/Laghos
   --partition 0
   --device cuda
   --no-checks
   --mem
   --fom
   --no-gpu-aware-mpi
   --dev 0
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 0 on node ip-172-31-47-188 exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------

camierjs commented 4 years ago

Hi Arturo,

I don't see the error with the same setup (hypre, metis and sm_37).

Here are the compiler versions I'm using: cuda 10.1, V10.1.168 and gcc (GCC) 7.5.0.

How many memory do you have on the K80, 24GB? It should be fine, as the run takes ~11GB.

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 430.40       Driver Version: 430.40       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Quadro GV100        Off  | 00000000:17:00.0 Off |                  Off |
| 31%   45C    P2   126W / 250W |  11378MiB / 32508MiB |     98%      Default |
+-------------------------------+----------------------+----------------------+
|   1  GeForce GTX 1070    Off  | 00000000:B3:00.0  On |                  N/A |
|  0%   44C    P0    44W / 151W |    180MiB /  8094MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0    229586      C   ./laghos                                   11367MiB |
|    1     50738      G   /usr/bin/X                                    63MiB |
|    1     50798      G   /usr/bin/gnome-shell                         104MiB |
+-----------------------------------------------------------------------------+

The lrun was to show our launcher to bind each GPU to each MPI rank: something specific to the cluster.

Jean-Sylvain

afernandezody commented 4 years ago

Hi, My CUDA is the latest release (10.2.89_440_33) but I doubt this is the origin of the error (gcc is also 7.5.0). I believe that AWS splits the memory with their K80 instances so the app probably only has access to 12 GB (the test with lower refinement was meant to use less memory although I don't know how much memory this saves if any at all). However, the error type is segmentation fault (11) and my recollection is that the error is (12) when an app cannot allocate enough memory. It could be that the code is trying to write/read just outside the available memory and returns error (11), but this is just a speculation that might not be factual. Other than using a different GPU, I cannot think what else to try. Best, Arturo

afernandezody commented 4 years ago

Hi Jean-Sylvain, I'm still trying to figure out which is the problem. Quick question: Are you using GDRcopy or GPUDirect? If the former, is it the master or a specific branch? Thanks, Arturo

camierjs commented 4 years ago

Hi Arturo, no GDRcopy and by default no GPUDirect too. There is a flag to enable GPUDirect, but it isn't set.

afernandezody commented 4 years ago

Hi Jean-Sylvain, My best guess at this junction is that the different response between your system and mine might be either the presence of GDRcopy or the CUDA version. I still think that the latter is unlikely to be the culprit. Figuring this out is going to require some reconfiguration tests so I'll get back to you when I have anything specific. Best, Arturo

afernandezody commented 4 years ago

Hi Jean-Sylvain, I've tried several combinations modifying or removing (e.g. gdrcopy) but the app always breaks with the segmentation fault error. My guess is still that something in the environment might be incompatible. Could you confirm that you are not using UCX? Thanks. Arturo

camierjs commented 4 years ago

Hi Arturo, I'm sorry I can't reproduce the error you are seeing. I'm not using UCX, no. Do you know any service/machine I could try to find to reproduce the error?

afernandezody commented 4 years ago

Hi Jean-Sylvain, I'm using only cloud-based hardware and all major providers offer some GPU capabilities. It takes a while to open an account (if you don't have one yet) and get familiar with the environment (not sure if you would be even interested in following this route). As a last resource, I could probably give you access to a Google Cloud Platform instance although I obviously wouldn't post the keypair/IP address within the thread but email them privately. That is something that you can consider and reply a bit later if you were interested. Anyhow, looking again at the different components, I still see that there might be some conflict with UCX. What I intend on doing is to open a ticket with the UCX team in case they have any insight on this issue. Best, Arturo

afernandezody commented 4 years ago

Jean-Sylvain, This is one of those situations where I just don't know whether the issue is caused by laghos triggering some undesired mapping, UCX or something else. I just opened a new thread with the UCX team (https://github.com/openucx/ucx/issues/#4988). They're usually very responsive.
Best, Arturo

afernandezody commented 4 years ago

Hi Jean-Sylvain, The UCX team is asking for a copy of the app so they can troubleshoot what's causing the error. I'll point them to the Github copy, but you're welcome to also chime in or if you have any other suggestion. Thanks, Arturo

afernandezody commented 4 years ago

Hi Jean-Sylvain, It's finally working in a CentOS environment. I thought about doing this 10 days ago but using CentOS has some complications for me because of unrelated (to Laghos) circumstances. Best, Arturo

afernandezody commented 4 years ago

Hi Jean-Sylvain, Thank you for everything. Scalability is a challenge (which is not necessarily a bad thing). I have a twofold question before closing the ticket: Are the results posted in one of your previous comments from computations on Lassen, and are these results in any document that can be cited? I have the ECP Milestone report but the results are for different parameters.

camierjs commented 4 years ago

Hi Arturo, I'm glad you were able to find an environment that allows you to do your testing. I'll keep a look at the UCX issue you opened. Yes the runs I made was on Lassen and I don't think that the results were used yet in something that could be cited.

afernandezody commented 4 years ago

Hi Jean-Sylvain, Sorry that my questions are becoming almost endless. I'm getting much better FOMs for higher refinement levels (e.g. - rs 4 -rp 2 or - rs 3 -rp 3 with many GPUs). By any chance, do you happen to have results for other refinement levels that you could share? Thanks, Arturo

camierjs commented 4 years ago

Hi Arturo,

The ECP MS32 milestone report is the best we have to show the weak and strong scaling of Laghos for different orders on different MPI runs (the options should be quite similar).

I agree that the focus is more on one of the kernels (FOM1) and less on the overall FOM you are now looking at and that we didn't had at that time.

What kind of orders/runs are you targeting?

Best, Jean-Sylvain

CEED / Laghos

Question about performance (mainly cuda code) #53