ddemidov / amgcl

C++ library for solving large sparse linear systems with algebraic multigrid method
http://amgcl.readthedocs.org/
MIT License
727 stars 111 forks source link

Problem running MPI example using block values with GPU compilation #119

Closed davidherreroperez closed 5 years ago

davidherreroperez commented 5 years ago

Hi Denis,

I'm running the example located at examples/mpi/mpi_amg.cpp and I have the following problem using the GPGPU compilation for OpenCL using VEXCL as follows

$ ./mpi_amg_vexcl_cl -b2 World size: 1

  1. Quadro P6000 (NVIDIA CUDA)

terminate called after throwing an instance of 'cl::Error' what(): clGetMemObjectInfo Aborted (core dumped)

The problem also occurs for the CUDA compilation using vexcl

$ ./mpi_amg_vexcl_cuda -b2 World size: 1

  1. Quadro P6000

Type: BiCGStab Unknowns: 1048576 Memory footprint: 112.00 M

Number of levels: 4 Operator complexity: 2.65 Grid complexity: 1.20

level unknowns nonzeros 0 1048576 3645440 (37.71%) [1] 1 176128 4592300 (47.50%) [1] 2 30518 1355860 (14.03%) [1] 3 1273 73445 ( 0.76%) [1]

./mpi_amg_vexcl_cuda(_ZN3vex6detail15print_backtraceEv+0x32) [0x55e7e7dae5c2] ./mpi_amg_vexcl_cuda(+0xc6c24) [0x55e7e7d90c24] ./mpi_amg_vexcl_cuda(_ZNK3vex8ReductorIdNS_9SUM_KahanEEclINS_17vector_expressionIN5boost5proto7exprns_10basic_exprINS6_6tagns_3tag10multipliesENS6_7argsns_5list2IRKNS_6vectorIdEESH_EELl2EEEEEEENSt9enable_ifIXsrNS6_7matchesIT_NS_19vector_exprgrammarEEE5valueEdE4typeERKSN+0x53d) [0x55e7e7e0aadd] ./mpi_amg_vexcl_cuda(_ZNK5amgcl3mpi13inner_productclIN3vex6vectorINS_13static_matrixIdLi2ELi1EEEEES7_EENS_4math18inner_product_implINS_7backend10value_typeIT_vE4typeEvE11return_typeERKSCRKT0+0x86) [0x55e7e7e0b9e6] ./mpi_amg_vexcl_cuda(_ZNK5amgcl6solver8bicgstabINS_7backend5vexclINS_13static_matrixIdLi2ELi2EEENS0_16vexcl_skyline_luIS5_EEEENS_3mpi13inner_productEEclINS9_18distributed_matrixIS8_EENS_7runtime3mpi14preconditionerIS8_EEN3vex6vectorINS4_IdLi2ELi1EEEEERSM_EESt5tupleIJmdEERKT_RKT0_RKT1OT2+0x206) [0x55e7e7e834f6] ./mpi_amg_vexcl_cuda(_Z11solve_blockILi2EEvN5amgcl3mpi12communicatorElRKSt6vectorIlSaIlEES7_RKS3_IdSaIdEERKN5boost13property_tree11basic_ptreeINSt7cxx1112basic_stringIcSt11char_traitsIcESaIcEEESK_St4lessISK_EEESB_NS0_7runtime3mpi9partition4typeE+0x11f5) [0x55e7e7f12cb5] ./mpi_amg_vexcl_cuda(main+0x14d0) [0x55e7e7d8ace0] /lib/x86_64-linux-gnu/libc.so.6(libc_start_main+0xe7) [0x7fddfed2bb97] ./mpi_amg_vexcl_cuda(_start+0x2a) [0x55e7e7d8b77a]

terminate called after throwing an instance of 'vex::backend::cuda::error' what(): /home/dherrero/bin/vexcl/vexcl/vexcl/backend/cuda/device_vector.hpp:142 CUDA Driver API Error (Unknown error 700) Aborted (core dumped)

The examples are running properly without block values. I will appreciate that you indicate me if I am doing wrong or if you also have such a problem.

ddemidov commented 5 years ago

Looks like insufficient memory on the GPU (although p6000 should have 24gb, which is more than enough).

Try running with -n 32 (the default n is 128; the problem size is n^3).

ddemidov commented 5 years ago

Sorry, did not catch the part about block values. I'll try to reproduce that when I get to my workstation, but the default problem is Poisson, which does not have block structure.

davidherreroperez commented 5 years ago

Sorry, did not catch the part about block values. I'll try to reproduce that when I get to my workstation, but the default problem is Poisson, which does not have block structure.

I have tested the problem using the problem discussed in

https://github.com/ddemidov/amgcl/issues/114

with the files

https://www.dropbox.com/sh/mf38r0mew9eloli/AACumScoRpVqjo1VDks98wgIa?dl=0

persisting the problem as follows

OK using CPU with block structure

./mpi_amg -A ordered_by_dim/A_37442.mtx -f ordered_by_dim/b_37442.mtx -p solver.type=cg solver.tol=1e-8 solver.maxiter=5000 precond.relax.type=ilu0 -b2

World size: 1 Type: CG Unknowns: 18721 Memory footprint: 1.14 M

Number of levels: 2 Operator complexity: 1.06 Grid complexity: 1.07

level unknowns nonzeros 0 18721 166753 (94.03%) [1] 1 1225 10585 ( 5.97%) [1]

Iterations: 77 Error: 7.52487e-09

[Profile: 0.759 s] (100.00%) [ self: 0.034 s] ( 4.53%) [ read: 0.543 s] ( 71.59%) [ setup: 0.025 s] ( 3.27%) [ solve: 0.156 s] ( 20.61%)

OK using vexcl with OpenCL backend

$ ./mpi_amg_vexcl_cl -A ordered_by_dim/A_37442.mtx -f ordered_by_dim/b_37442.mtx -p solver.type=cg solver.tol=1e-8 solver.maxiter=5000 precond.relax.type=ilu0 precond.coarsening.aggr.block_size=2

World size: 1

  1. Quadro P6000 (NVIDIA CUDA)

Type: CG Unknowns: 37442 Memory footprint: 1.14 M

Number of levels: 2 Operator complexity: 1.06 Grid complexity: 1.07

level unknowns nonzeros 0 37442 667012 (94.03%) [1] 1 2450 42340 ( 5.97%) [1]

Iterations: 52 Error: 6.99069e-09

[Profile: 0.738 s] (100.00%) [ self: 0.117 s] ( 15.91%) [ read: 0.509 s] ( 68.92%) [ setup: 0.058 s] ( 7.87%) [ solve: 0.054 s] ( 7.31%)

Fail using block structure with vexcl with OpenCL backend

$ ./mpi_amg_vexcl_cl -A ordered_by_dim/A_37442.mtx -f ordered_by_dim/b_37442.mtx -p solver.type=cg solver.tol=1e-8 solver.maxiter=5000 precond.relax.type=ilu0 -b2

World size: 1

  1. Quadro P6000 (NVIDIA CUDA)

terminate called after throwing an instance of 'cl::Error' what(): clGetMemObjectInfo Aborted (core dumped)

ddemidov commented 5 years ago

Does it work with solver_vexcl_cl (non-mpi)?

ddemidov commented 5 years ago

Ok, I can reproduce the problem with mpi_amg_vexcl_cl. Will look into it.

ddemidov commented 5 years ago

Should be fixed now, thank you for reporting!

There are related commits in vexcl (ddemidov/vexcl@5778e23f8b00b3151031ee28fec83972b01a6d11 and ddemidov/vexcl@9b24cacb28c39cd3412211b44c98108f800fec8c), and also here (df925065295893952a2b64f89e46f47cb4d5726d), but these are only required for debug builds.

The main fix comes from 7b95a8e0d11f52308dfd3d490dd47f7166f0ee88.

I did not try this with mpi_amg_vexcl_cuda (do not have a useable cuda configuration at the moment); please see if the fix also works there.

EDIT: mpi_amg_vexcl_cuda works for me as well.

davidherreroperez commented 5 years ago

Hi Denis, thanks for your prompt response. Both versions (vexcl_cl and vexcl_cuda) are also working for me!