AMGCL on AMD GPU - Githubissues

pelyakim commented 1 year ago

Hello, I would like to know if you have already used your AMGCL library on AMD graphics cards? It would be for the resolution of a pseudo Poisson equation in a fluid mechanics code (finite volume) with curvilinear structured meshes with a resolution on CPU in parallel (MPI) and on GPU (for the Poisson solver). Thank you very much for your answer, Pierre

ddemidov commented 1 year ago

Hello Pierre,

Yes, you can use amgcl with AMD cards using OpenCL via vexcl backend. There is an example (of using MPI/vexcl) here: https://amgcl.readthedocs.io/en/latest/tutorial/SerenaMPI.html#id2

pelyakim commented 1 year ago

Thanks for your answer. My partitioning is already done, and I stored it in a vector the size of my domain (nxnynz in 3D). Is it possible to use it in this state ? Moreover, I would like to use a PCG preconditioner for example and as AMG solver, is it possible ? Thanks for your answers. Pierre

ddemidov commented 1 year ago

It should be possible. See more details on partitioning and MPI here: https://amgcl.readthedocs.io/en/latest/tutorial/poisson3DbMPI.html. There is an example of using PCG here: https://amgcl.readthedocs.io/en/latest/tutorial/NullspaceMPI.html

pelyakim commented 1 year ago

Thank you. Can you explain me what the make_shared function does and where the sources are? Thanks

ddemidov commented 1 year ago

https://en.cppreference.com/w/cpp/memory/shared_ptr/make_shared

pelyakim commented 1 year ago

Sorry, i would like to know for the distributed_matrix function. Thanks

pelyakim commented 1 year ago

Hello, I could not compile AMGCL with VexCl, I have the impression that it does not find VexCL. Also, I don't know how to install VexCL. Could you help me with this installation. Then I would like to run the tutorials, especially the Poisson problem in parallel with VexCL on AMD graphics cards, for that I can't find the matrix poisson3Db.bin, could you tell me where I can find it? Thanks for all these indications. Sincerely, Pierre

ddemidov commented 1 year ago

Try the following in a separate folder:

git clone https://github.com/ddemidov/vexcl
cmake -Bvexcl_build vexcl

After this, try to reconfigure amgcl. It should find vexcl now.

pelyakim commented 1 year ago

Thanks, it did find vexcl when running cmake. Also, I can't find the fish3Db.bin file to test the fish3Db_mpi_vexcl_cl executable from the Fish Problem tutorial in mpi, could you tell me where I can find it? Thanks, Pierre

ddemidov commented 1 year ago

You can convert the mtx file to bin file using examples/mm2bin utility, search for 'mm2bin' on this page: https://amgcl.readthedocs.io/en/latest/tutorial/Serena.html?highlight=mm2bin#structural-problem.

There is a link to download the Poisson3Db matrix here: https://amgcl.readthedocs.io/en/latest/tutorial/poisson3Db.html

pelyakim commented 1 year ago

Hi, thanks but I have a problem with the AMGCL build: I can't link the scotch installation to AMGCL. When I run the command to build the Makefile :cmake -DCMAKE_INSTALL_PREFIX=/lus/home/pelyakime/AMGCL/scotch-v7.0.1/install -DCMAKE_BUILD_TYPE=Release .. I have the impression that it does not find the scotch library. How can I do ? Thanks for your help

ddemidov commented 1 year ago

https://github.com/ddemidov/amgcl/issues/255

pelyakim commented 1 year ago

Thank you very much for your help, I don't have any problem with libraries anymore and I can test the executables of the poisson3Db tutorial : poisson3Db and poisson3Db_mpi works without problem. But with poisson3Db_mpi_vexcl I have a segmentation fault problem. I'm on an AMD architecture with AMD graphics cards, and I hope to use OpenCL to compute on AMD GPUs (I compute on the Adastra machine in France). I managed to recompile the sources (pois3Db_mpi_vexcl.cpp) and I have the impression that I have a problem as soon as I get to this place of the code :

for(int i = 0; i < world.size; ++i) {
        // unclutter the output:
        if (i == world.rank)
            std::cout << world.rank << ":" << ctx.queue(0) << std::endl;
        MPI_Barrier(world);
    }

I'm not sure where the problem could be coming from, would you have any idea? Thanks a lot for your help. Pierre

ddemidov commented 1 year ago

Looks like you don't have any GPUs in the context. What does vexcl/examples/devlist output on your system?

pelyakim commented 1 year ago

Hello, when I put this command in my script slurm I have :

Currently Loaded Modules:
  1) craype-network-ofi        9) cray-mpich/8.1.24
  2) craype-x86-trento        10) craype/2.7.19
  3) craype-accel-amd-gfx90a  11) perftools-base/23.02.0
  4) libfabric/1.15.2.0       12) rocm/5.2.0
  5) PrgEnv-cray/8.3.3        13) cpe/23.02
  6) cce/15.0.1               14) CPE-23.02-cce-15.0.1-GPU-softs
  7) cray-dsmml/0.2.2         15) scotch/6.1.3-mpi
  8) cray-libsci/23.02.1.1    16) boost/1.81.0-mpi-python3

    linux-vdso.so.1 (0x00007ffe9ab92000)
    libOpenCL.so.1 => /opt/rocm-5.2.0/lib/libOpenCL.so.1 (0x000015321c880000)
    libboost_filesystem-mt-x64.so.1.81.0 => /opt/software/gaia/dev/1.0.3-e7de077a/__spack_path_placeholder__/__spack_path_placeholder__/__spack_path_placeholder__/__spack_p/boost-1.81.0-cce-15.0.1-cn4o/lib/libboost_filesystem-mt-x64.so.1.81.0 (0x000015321cc86000)
    libamdhip64.so.5 => /opt/rocm-5.2.0/hip/lib/libamdhip64.so.5 (0x000015321b98b000)
    libmpi_cray.so.12 => /opt/cray/pe/lib64/libmpi_cray.so.12 (0x0000153218ffd000)
    libmpi_gtl_hsa.so.0 => /opt/cray/pe/lib64/libmpi_gtl_hsa.so.0 (0x0000153218d9a000)
    libdl.so.2 => /lib64/libdl.so.2 (0x0000153218b96000)
    libstdc++.so.6 => /opt/cray/pe/gcc-libs/libstdc++.so.6 (0x0000153218770000)
    libfi.so.1 => /opt/cray/pe/cce/15.0.1/cce/x86_64/lib/libfi.so.1 (0x00001532181cb000)
    libquadmath.so.0 => /opt/cray/pe/cce/15.0.1/cce/x86_64/lib/libquadmath.so.0 (0x0000153217f84000)
    libmodules.so.1 => /opt/cray/pe/cce/15.0.1/cce/x86_64/lib/libmodules.so.1 (0x000015321cc5d000)
    libcraymath.so.1 => /opt/cray/pe/cce/15.0.1/cce/x86_64/lib/libcraymath.so.1 (0x000015321cb74000)
    libf.so.1 => /opt/cray/pe/cce/15.0.1/cce/x86_64/lib/libf.so.1 (0x000015321cae0000)
    libu.so.1 => /opt/cray/pe/cce/15.0.1/cce/x86_64/lib/libu.so.1 (0x0000153217e7b000)
    libcsup.so.1 => /opt/cray/pe/cce/15.0.1/cce/x86_64/lib/libcsup.so.1 (0x000015321cad7000)
    libpthread.so.0 => /lib64/libpthread.so.0 (0x0000153217c5b000)
    libm.so.6 => /lib64/libm.so.6 (0x00001532178d9000)
    libunwind.so.1 => /opt/cray/pe/cce/15.0.1/cce-clang/x86_64/lib/libunwind.so.1 (0x000015321cac0000)
    libc.so.6 => /lib64/libc.so.6 (0x0000153217514000)
    librt.so.1 => /lib64/librt.so.1 (0x000015321730c000)
    libamd_comgr.so.2 => /opt/rocm-5.2.0/lib/libamd_comgr.so.2 (0x000015320fc5c000)
    libhsa-runtime64.so.1 => /opt/rocm-5.2.0/lib/libhsa-runtime64.so.1 (0x000015320f80f000)
    libnuma.so.1 => /lib64/libnuma.so.1 (0x000015320f603000)
    libgcc_s.so.1 => /opt/cray/pe/gcc-libs/libgcc_s.so.1 (0x000015320f3e4000)
    /lib64/ld-linux-x86-64.so.2 (0x000015321ca88000)
    libfabric.so.1 => /opt/cray/libfabric/1.15.2.0/lib64/libfabric.so.1 (0x000015320f0f1000)
    libpmi.so.0 => /opt/cray/pe/lib64/libpmi.so.0 (0x000015320eeef000)
    libpmi2.so.0 => /opt/cray/pe/lib64/libpmi2.so.0 (0x000015320ecce000)
    libgfortran.so.5 => /opt/cray/pe/gcc-libs/libgfortran.so.5 (0x000015320e801000)
    libz.so.1 => /lib64/libz.so.1 (0x000015320e5e9000)
    libtinfo.so.6 => /lib64/libtinfo.so.6 (0x000015320e3bc000)
    libelf.so.1 => /lib64/libelf.so.1 (0x000015320e1a3000)
    libdrm.so.2 => /opt/amdgpu/lib64/libdrm.so.2 (0x000015320df8f000)
    libdrm_amdgpu.so.1 => /opt/amdgpu/lib64/libdrm_amdgpu.so.1 (0x000015320dd83000)
    libcxi.so.1 => /lib64/libcxi.so.1 (0x000015320db5e000)
    libcurl.so.4 => /lib64/libcurl.so.4 (0x000015320d8d0000)
    libjson-c.so.4 => /lib64/libjson-c.so.4 (0x000015320d6c0000)
    libatomic.so.1 => /opt/cray/pe/gcc-libs/libatomic.so.1 (0x000015320d4b7000)
    libpals.so.0 => /opt/cray/pe/lib64/libpals.so.0 (0x000015320d2af000)
    libnghttp2.so.14 => /lib64/libnghttp2.so.14 (0x000015320d088000)
    libidn2.so.0 => /lib64/libidn2.so.0 (0x000015320ce6a000)
    libssh.so.4 => /lib64/libssh.so.4 (0x000015320cbfb000)
    libpsl.so.5 => /lib64/libpsl.so.5 (0x000015320c9ea000)
    libssl.so.1.1 => /lib64/libssl.so.1.1 (0x000015320c754000)
    libcrypto.so.1.1 => /lib64/libcrypto.so.1.1 (0x000015320c26b000)
    libgssapi_krb5.so.2 => /lib64/libgssapi_krb5.so.2 (0x000015320c016000)
    libkrb5.so.3 => /lib64/libkrb5.so.3 (0x000015320bd2c000)
    libk5crypto.so.3 => /lib64/libk5crypto.so.3 (0x000015320bb15000)
    libcom_err.so.2 => /lib64/libcom_err.so.2 (0x000015320b911000)
    libldap-2.4.so.2 => /lib64/libldap-2.4.so.2 (0x000015320b6c0000)
    liblber-2.4.so.2 => /lib64/liblber-2.4.so.2 (0x000015320b4b0000)
    libbrotlidec.so.1 => /lib64/libbrotlidec.so.1 (0x000015320b2a3000)
    libjansson.so.4 => /lib64/libjansson.so.4 (0x000015320b095000)
    libunistring.so.2 => /lib64/libunistring.so.2 (0x000015320ad14000)
    libkrb5support.so.0 => /lib64/libkrb5support.so.0 (0x000015320ab01000)
    libkeyutils.so.1 => /lib64/libkeyutils.so.1 (0x000015320a8fd000)
    libresolv.so.2 => /lib64/libresolv.so.2 (0x000015320a6e6000)
    libsasl2.so.3 => /lib64/libsasl2.so.3 (0x000015320a4c8000)
    libbrotlicommon.so.1 => /lib64/libbrotlicommon.so.1 (0x000015320a2a7000)
    libselinux.so.1 => /lib64/libselinux.so.1 (0x000015320a07b000)
    libcrypt.so.1 => /lib64/libcrypt.so.1 (0x0000153209e52000)
    libpcre2-8.so.0 => /lib64/libpcre2-8.so.0 (0x0000153209bce000)
OpenCL devices:

  gfx90a:sramecc+:xnack-
    CL_PLATFORM_NAME                 = AMD Accelerated Parallel Processing
    CL_DEVICE_TYPE                   = 4
    CL_DEVICE_VENDOR                 = Advanced Micro Devices, Inc.
    CL_DEVICE_VERSION                = OpenCL 2.0 
    CL_DEVICE_MAX_COMPUTE_UNITS      = 110
    CL_DEVICE_HOST_UNIFIED_MEMORY    = 0
    CL_DEVICE_GLOBAL_MEM_SIZE        = 68702699520
    CL_DEVICE_LOCAL_MEM_SIZE         = 65536
    CL_DEVICE_MAX_MEM_ALLOC_SIZE     = 58397294592
    CL_DEVICE_ADDRESS_BITS           = 64
    CL_DEVICE_MAX_CLOCK_FREQUENCY    = 1700
    CL_DEVICE_EXTENSIONS             = cl_amd_assembly_program 
        cl_amd_copy_buffer_p2p cl_amd_device_attribute_query cl_amd_media_ops 
        cl_amd_media_ops2 cl_khr_3d_image_writes cl_khr_byte_addressable_store 
        cl_khr_depth_images cl_khr_fp16 cl_khr_fp64 cl_khr_gl_sharing 
        cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics 
        cl_khr_image2d_from_buffer cl_khr_int64_base_atomics 
        cl_khr_int64_extended_atomics cl_khr_local_int32_base_atomics 
        cl_khr_local_int32_extended_atomics cl_khr_subgroups 

  gfx90a:sramecc+:xnack-
    CL_PLATFORM_NAME                 = AMD Accelerated Parallel Processing
    CL_DEVICE_TYPE                   = 4
    CL_DEVICE_VENDOR                 = Advanced Micro Devices, Inc.
    CL_DEVICE_VERSION                = OpenCL 2.0 
    CL_DEVICE_MAX_COMPUTE_UNITS      = 110
    CL_DEVICE_HOST_UNIFIED_MEMORY    = 0
    CL_DEVICE_GLOBAL_MEM_SIZE        = 68702699520
    CL_DEVICE_LOCAL_MEM_SIZE         = 65536
    CL_DEVICE_MAX_MEM_ALLOC_SIZE     = 58397294592
    CL_DEVICE_ADDRESS_BITS           = 64
    CL_DEVICE_MAX_CLOCK_FREQUENCY    = 1700
    CL_DEVICE_EXTENSIONS             = cl_amd_assembly_program 
        cl_amd_copy_buffer_p2p cl_amd_device_attribute_query cl_amd_media_ops 
        cl_amd_media_ops2 cl_khr_3d_image_writes cl_khr_byte_addressable_store 
        cl_khr_depth_images cl_khr_fp16 cl_khr_fp64 cl_khr_gl_sharing 
        cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics 
        cl_khr_image2d_from_buffer cl_khr_int64_base_atomics 
        cl_khr_int64_extended_atomics cl_khr_local_int32_base_atomics 
        cl_khr_local_int32_extended_atomics cl_khr_subgroups 

  gfx90a:sramecc+:xnack-
    CL_PLATFORM_NAME                 = AMD Accelerated Parallel Processing
    CL_DEVICE_TYPE                   = 4
    CL_DEVICE_VENDOR                 = Advanced Micro Devices, Inc.
    CL_DEVICE_VERSION                = OpenCL 2.0 
    CL_DEVICE_MAX_COMPUTE_UNITS      = 110
    CL_DEVICE_HOST_UNIFIED_MEMORY    = 0
    CL_DEVICE_GLOBAL_MEM_SIZE        = 68702699520
    CL_DEVICE_LOCAL_MEM_SIZE         = 65536
    CL_DEVICE_MAX_MEM_ALLOC_SIZE     = 58397294592
    CL_DEVICE_ADDRESS_BITS           = 64
    CL_DEVICE_MAX_CLOCK_FREQUENCY    = 1700
    CL_DEVICE_EXTENSIONS             = cl_amd_assembly_program 
        cl_amd_copy_buffer_p2p cl_amd_device_attribute_query cl_amd_media_ops 
        cl_amd_media_ops2 cl_khr_3d_image_writes cl_khr_byte_addressable_store 
        cl_khr_depth_images cl_khr_fp16 cl_khr_fp64 cl_khr_gl_sharing 
        cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics 
        cl_khr_image2d_from_buffer cl_khr_int64_base_atomics 
        cl_khr_int64_extended_atomics cl_khr_local_int32_base_atomics 
        cl_khr_local_int32_extended_atomics cl_khr_subgroups 

  gfx90a:sramecc+:xnack-
    CL_PLATFORM_NAME                 = AMD Accelerated Parallel Processing
    CL_DEVICE_TYPE                   = 4
    CL_DEVICE_VENDOR                 = Advanced Micro Devices, Inc.
    CL_DEVICE_VERSION                = OpenCL 2.0 
    CL_DEVICE_MAX_COMPUTE_UNITS      = 110
    CL_DEVICE_HOST_UNIFIED_MEMORY    = 0
    CL_DEVICE_GLOBAL_MEM_SIZE        = 68702699520
    CL_DEVICE_LOCAL_MEM_SIZE         = 65536
    CL_DEVICE_MAX_MEM_ALLOC_SIZE     = 58397294592
    CL_DEVICE_ADDRESS_BITS           = 64
    CL_DEVICE_MAX_CLOCK_FREQUENCY    = 1700
    CL_DEVICE_EXTENSIONS             = cl_amd_assembly_program 
        cl_amd_copy_buffer_p2p cl_amd_device_attribute_query cl_amd_media_ops 
        cl_amd_media_ops2 cl_khr_3d_image_writes cl_khr_byte_addressable_store 
        cl_khr_depth_images cl_khr_fp16 cl_khr_fp64 cl_khr_gl_sharing 
        cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics 
        cl_khr_image2d_from_buffer cl_khr_int64_base_atomics 
        cl_khr_int64_extended_atomics cl_khr_local_int32_base_atomics 
        cl_khr_local_int32_extended_atomics cl_khr_subgroups 

  gfx90a:sramecc+:xnack-
    CL_PLATFORM_NAME                 = AMD Accelerated Parallel Processing
    CL_DEVICE_TYPE                   = 4
    CL_DEVICE_VENDOR                 = Advanced Micro Devices, Inc.
    CL_DEVICE_VERSION                = OpenCL 2.0 
    CL_DEVICE_MAX_COMPUTE_UNITS      = 110
    CL_DEVICE_HOST_UNIFIED_MEMORY    = 0
    CL_DEVICE_GLOBAL_MEM_SIZE        = 68702699520
    CL_DEVICE_LOCAL_MEM_SIZE         = 65536
    CL_DEVICE_MAX_MEM_ALLOC_SIZE     = 58397294592
    CL_DEVICE_ADDRESS_BITS           = 64
    CL_DEVICE_MAX_CLOCK_FREQUENCY    = 1700
    CL_DEVICE_EXTENSIONS             = cl_amd_assembly_program 
        cl_amd_copy_buffer_p2p cl_amd_device_attribute_query cl_amd_media_ops 
        cl_amd_media_ops2 cl_khr_3d_image_writes cl_khr_byte_addressable_store 
        cl_khr_depth_images cl_khr_fp16 cl_khr_fp64 cl_khr_gl_sharing 
        cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics 
        cl_khr_image2d_from_buffer cl_khr_int64_base_atomics 
        cl_khr_int64_extended_atomics cl_khr_local_int32_base_atomics 
        cl_khr_local_int32_extended_atomics cl_khr_subgroups 

  gfx90a:sramecc+:xnack-
    CL_PLATFORM_NAME                 = AMD Accelerated Parallel Processing
    CL_DEVICE_TYPE                   = 4
    CL_DEVICE_VENDOR                 = Advanced Micro Devices, Inc.
    CL_DEVICE_VERSION                = OpenCL 2.0 
    CL_DEVICE_MAX_COMPUTE_UNITS      = 110
    CL_DEVICE_HOST_UNIFIED_MEMORY    = 0
    CL_DEVICE_GLOBAL_MEM_SIZE        = 68702699520
    CL_DEVICE_LOCAL_MEM_SIZE         = 65536
    CL_DEVICE_MAX_MEM_ALLOC_SIZE     = 58397294592
    CL_DEVICE_ADDRESS_BITS           = 64
    CL_DEVICE_MAX_CLOCK_FREQUENCY    = 1700
    CL_DEVICE_EXTENSIONS             = cl_amd_assembly_program 
        cl_amd_copy_buffer_p2p cl_amd_device_attribute_query cl_amd_media_ops 
        cl_amd_media_ops2 cl_khr_3d_image_writes cl_khr_byte_addressable_store 
        cl_khr_depth_images cl_khr_fp16 cl_khr_fp64 cl_khr_gl_sharing 
        cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics 
        cl_khr_image2d_from_buffer cl_khr_int64_base_atomics 
        cl_khr_int64_extended_atomics cl_khr_local_int32_base_atomics 
        cl_khr_local_int32_extended_atomics cl_khr_subgroups 

  gfx90a:sramecc+:xnack-
    CL_PLATFORM_NAME                 = AMD Accelerated Parallel Processing
    CL_DEVICE_TYPE                   = 4
    CL_DEVICE_VENDOR                 = Advanced Micro Devices, Inc.
    CL_DEVICE_VERSION                = OpenCL 2.0 
    CL_DEVICE_MAX_COMPUTE_UNITS      = 110
    CL_DEVICE_HOST_UNIFIED_MEMORY    = 0
    CL_DEVICE_GLOBAL_MEM_SIZE        = 68702699520
    CL_DEVICE_LOCAL_MEM_SIZE         = 65536
    CL_DEVICE_MAX_MEM_ALLOC_SIZE     = 58397294592
    CL_DEVICE_ADDRESS_BITS           = 64
    CL_DEVICE_MAX_CLOCK_FREQUENCY    = 1700
    CL_DEVICE_EXTENSIONS             = cl_amd_assembly_program 
        cl_amd_copy_buffer_p2p cl_amd_device_attribute_query cl_amd_media_ops 
        cl_amd_media_ops2 cl_khr_3d_image_writes cl_khr_byte_addressable_store 
        cl_khr_depth_images cl_khr_fp16 cl_khr_fp64 cl_khr_gl_sharing 
        cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics 
        cl_khr_image2d_from_buffer cl_khr_int64_base_atomics 
        cl_khr_int64_extended_atomics cl_khr_local_int32_base_atomics 
        cl_khr_local_int32_extended_atomics cl_khr_subgroups 

  gfx90a:sramecc+:xnack-
    CL_PLATFORM_NAME                 = AMD Accelerated Parallel Processing
    CL_DEVICE_TYPE                   = 4
    CL_DEVICE_VENDOR                 = Advanced Micro Devices, Inc.
    CL_DEVICE_VERSION                = OpenCL 2.0 
    CL_DEVICE_MAX_COMPUTE_UNITS      = 110
    CL_DEVICE_HOST_UNIFIED_MEMORY    = 0
    CL_DEVICE_GLOBAL_MEM_SIZE        = 68702699520
    CL_DEVICE_LOCAL_MEM_SIZE         = 65536
    CL_DEVICE_MAX_MEM_ALLOC_SIZE     = 58397294592
    CL_DEVICE_ADDRESS_BITS           = 64
    CL_DEVICE_MAX_CLOCK_FREQUENCY    = 1700
    CL_DEVICE_EXTENSIONS             = cl_amd_assembly_program 
        cl_amd_copy_buffer_p2p cl_amd_device_attribute_query cl_amd_media_ops 
        cl_amd_media_ops2 cl_khr_3d_image_writes cl_khr_byte_addressable_store 
        cl_khr_depth_images cl_khr_fp16 cl_khr_fp64 cl_khr_gl_sharing 
        cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics 
        cl_khr_image2d_from_buffer cl_khr_int64_base_atomics 
        cl_khr_int64_extended_atomics cl_khr_local_int32_base_atomics 
        cl_khr_local_int32_extended_atomics cl_khr_subgroups 

cpu-bind=MASK - g1245, task  0  0 [4127093]: mask 0xffffffff00000000ffffffff set
cpu-bind=MASK - g1245, task  1  1 [4127094]: mask 0xffffffff00000000ffffffff00000000 set
srun: error: g1245: task 1: Segmentation fault (core dumped)
srun: launch/slurm: _step_signal: Terminating StepId=116457.0
slurmstepd: error: *** STEP 116457.0 ON g1245 CANCELLED AT 2023-04-07T10:30:51 ***
srun: error: g1245: task 0: Terminated
srun: Force Terminated StepId=116457.0

pelyakim commented 1 year ago

I don't know if this has anything to do with it but when building vexcl, it doesn't find OPENCL_HPP

/vexcl_cce$ cmake -Bvexcl_build -DVEXCL_BUILD_TESTS=ON -DVEXCL_BUILD_EXAMPLES=ON
-- No build type selected, default to RelWithDebInfo
-- The C compiler identification is Clang 15.0.6
-- The CXX compiler identification is Clang 15.0.6
-- Cray Programming Environment 2.7.19 C
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Check for working C compiler: /opt/cray/pe/craype/2.7.19/bin/cc - skipped
-- Detecting C compile features
-- Detecting C compile features - done
-- Cray Programming Environment 2.7.19 CXX
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Check for working CXX compiler: /opt/cray/pe/craype/2.7.19/bin/CC - skipped
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Found Boost: /opt/software/gaia/dev/1.0.3-e7de077a/__spack_path_placeholder__/__spack_path_placeholder__/__spack_path_placeholder__/__spack_p/boost-1.81.0-cce-15.0.1-cn4o/lib/cmake/Boost-1.81.0/BoostConfig.cmake (found version "1.81.0") found components: chrono date_time filesystem program_options system thread unit_test_framework 
-- Looking for CL_VERSION_2_2
-- Looking for CL_VERSION_2_2 - found
-- Found OpenCL: /opt/rocm-5.2.0/lib/libOpenCL.so (found version "2.2") 
--  -- OPENCL_HPP-NOTFOUND --
-- Found VexCL::OpenCL
-- Found VexCL::Compute
CUDA_TOOLKIT_ROOT_DIR not found or specified
-- Could NOT find CUDA (missing: CUDA_TOOLKIT_ROOT_DIR CUDA_NVCC_EXECUTABLE CUDA_INCLUDE_DIRS CUDA_CUDART_LIBRARY) 
-- Found OpenMP_C: -fopenmp (found version "5.0") 
-- Found OpenMP_CXX: -fopenmp (found version "5.0") 
-- Found OpenMP: TRUE (found version "5.0")  
-- Found VexCL::JIT
-- Selected backend: OpenCL
-- Configuring done
-- Generating done
-- Build files have been written to:

And when I build Tests and examples there are one error :

vexcl_build$ make -j
[  1%] Building CXX object tests/CMakeFiles/fft.dir/fft.cpp.o
[  4%] Building CXX object tests/CMakeFiles/context.dir/context.cpp.o
[  5%] Building CXX object tests/CMakeFiles/scan.dir/scan.cpp.o
[  5%] Building CXX object tests/CMakeFiles/vector_io.dir/vector_io.cpp.o
[  7%] Building CXX object tests/CMakeFiles/vector_pointer.dir/vector_pointer.cpp.o
[ 10%] Building CXX object tests/CMakeFiles/multi_array.dir/multi_array.cpp.o
[ 10%] Building CXX object tests/CMakeFiles/vector_view.dir/vector_view.cpp.o
[ 13%] Building CXX object tests/CMakeFiles/temporary.dir/temporary.cpp.o
[ 13%] Building CXX object tests/CMakeFiles/image.dir/image.cpp.o
[ 13%] Building CXX object tests/CMakeFiles/stencil.dir/stencil.cpp.o
[ 17%] Building CXX object tests/CMakeFiles/cast.dir/cast.cpp.o
[ 20%] Building CXX object tests/CMakeFiles/reduce_by_key.dir/reduce_by_key.cpp.o
[ 20%] Building CXX object tests/CMakeFiles/spmv.dir/spmv.cpp.o
[ 20%] Building CXX object tests/CMakeFiles/tensordot.dir/tensordot.cpp.o
[ 20%] Building CXX object tests/CMakeFiles/multivector_create.dir/multivector_create.cpp.o
[ 20%] Building CXX object tests/CMakeFiles/sparse_matrices.dir/sparse_matrices.cpp.o
[ 20%] Building CXX object tests/CMakeFiles/threads.dir/threads.cpp.o
[ 25%] Building CXX object tests/CMakeFiles/multivector_arithmetics.dir/multivector_arithmetics.cpp.o
[ 25%] Building CXX object tests/CMakeFiles/vector_arithmetics.dir/vector_arithmetics.cpp.o
[ 25%] Building CXX object tests/CMakeFiles/events.dir/events.cpp.o
[ 25%] Building CXX object tests/CMakeFiles/logical.dir/logical.cpp.o
[ 25%] Building CXX object tests/CMakeFiles/reinterpret.dir/reinterpret.cpp.o
[ 28%] Building CXX object tests/CMakeFiles/vector_create.dir/vector_create.cpp.o
[ 28%] Building CXX object tests/CMakeFiles/boost_version.dir/boost_version.cpp.o
[ 30%] Building CXX object tests/CMakeFiles/random.dir/random.cpp.o
[ 30%] Building CXX object tests/CMakeFiles/custom_kernel.dir/custom_kernel.cpp.o
[ 30%] Building CXX object tests/CMakeFiles/tagged_terminal.dir/tagged_terminal.cpp.o
[ 30%] Building CXX object tests/CMakeFiles/deduce.dir/deduce.cpp.o
[ 30%] Building CXX object tests/CMakeFiles/types.dir/types.cpp.o
[ 32%] Building CXX object tests/CMakeFiles/vector_copy.dir/vector_copy.cpp.o
[ 33%] Building CXX object tests/CMakeFiles/multiple_objects.dir/dummy1.cpp.o
[ 33%] Building CXX object tests/CMakeFiles/multiple_objects.dir/dummy2.cpp.o
[ 35%] Building CXX object tests/CMakeFiles/generator.dir/generator.cpp.o
[ 40%] Building CXX object tests/CMakeFiles/mba.dir/mba.cpp.o
[ 40%] Building CXX object tests/CMakeFiles/scan_by_key.dir/scan_by_key.cpp.o
[ 40%] Building CXX object tests/CMakeFiles/sort.dir/sort.cpp.o
[ 43%] Building CXX object tests/CMakeFiles/constants.dir/constants.cpp.o
[ 43%] Building CXX object examples/CMakeFiles/fft_benchmark.dir/fft_benchmark.cpp.o
[ 45%] Building CXX object tests/CMakeFiles/eval.dir/eval.cpp.o
[ 45%] Building CXX object examples/CMakeFiles/fft_profile.dir/fft_profile.cpp.o
[ 45%] Building CXX object examples/CMakeFiles/exclusive.dir/exclusive.cpp.o
[ 45%] Building CXX object examples/CMakeFiles/mba_benchmark.dir/mba_benchmark.cpp.o
[ 45%] Building CXX object examples/CMakeFiles/benchmark.dir/benchmark.cpp.o
[ 49%] Building CXX object examples/CMakeFiles/devlist.dir/devlist.cpp.o
[ 49%] Building CXX object examples/CMakeFiles/complex_simple.dir/complex_simple.cpp.o
[ 49%] Building CXX object examples/CMakeFiles/symbolic.dir/symbolic.cpp.o
[ 49%] Building CXX object examples/CMakeFiles/complex_spmv.dir/complex_spmv.cpp.o
[ 50%] Building CXX object tests/CMakeFiles/svm.dir/svm.cpp.o
[ 51%] Linking CXX executable boost_version
[ 51%] Built target boost_version
/lus/home/pelyakime/AMGCL/vexcl_cce/tests/vector_copy.cpp:81:40: warning: lambda capture 'n' is not required to be captured for this use [-Wunused-lambda-capture]
    std::generate(i.begin(), i.end(), [n](){ return rand() % n; });
                                       ^
/lus/home/pelyakime/AMGCL/vexcl_cce/tests/threads.cpp:13:17: warning: lambda capture 'n' is not required to be captured for this use [-Wunused-lambda-capture]
    auto run = [n](vex::backend::command_queue queue, cl_long *s) {
                ^
In file included from /lus/home/pelyakime/AMGCL/vexcl_cce/tests/vector_create.cpp:3:
In file included from /lus/home/pelyakime/AMGCL/vexcl_cce/vexcl/vector.hpp:51:
/lus/home/pelyakime/AMGCL/vexcl_cce/vexcl/operations.hpp:755:42: error: call to implicitly-deleted default constructor of 'boost::proto::exprns_::basic_expr<boost::proto::tagns_::tag::multiplies, boost::proto::argsns_::list2<vex::vector_expression<boost::proto::exprns_::basic_expr<boost::proto::tagns_::tag::terminal, boost::proto::argsns_::term<int>, 0>>, vex::vector<int> &>, 2>'
    vector_expression(const Expr &expr = Expr())
                                         ^
/opt/cray/pe/gcc/10.3.0/snos/lib/gcc/x86_64-centos-linux/10.3.0/../../../../include/g++/type_traits:901:30: note: in instantiation of default function argument expression for 'vector_expression<boost::proto::exprns_::basic_expr<boost::proto::tagns_::tag::multiplies, boost::proto::argsns_::list2<vex::vector_expression<boost::proto::exprns_::basic_expr<boost::proto::tagns_::tag::terminal, boost::proto::argsns_::term<int>, 0>>, vex::vector<int> &>, 2>>' required here
    : public __bool_constant<__is_constructible(_Tp, _Args...)>
                             ^
/opt/cray/pe/gcc/10.3.0/snos/lib/gcc/x86_64-centos-linux/10.3.0/../../../../include/g++/type_traits:139:26: note: in instantiation of template class 'std::__is_constructible_impl<vex::vector_expression<boost::proto::exprns_::basic_expr<boost::proto::tagns_::tag::multiplies, boost::proto::argsns_::list2<vex::vector_expression<boost::proto::exprns_::basic_expr<boost::proto::tagns_::tag::terminal, boost::proto::argsns_::term<int>, 0>>, vex::vector<int> &>, 2>>>' requested here
    : public conditional<_B1::value, _B2, _B1>::type
                         ^
/opt/cray/pe/gcc/10.3.0/snos/lib/gcc/x86_64-centos-linux/10.3.0/../../../../include/g++/type_traits:1224:14: note: in instantiation of template class 'std::__and_<std::__is_constructible_impl<vex::vector_expression<boost::proto::exprns_::basic_expr<boost::proto::tagns_::tag::multiplies, boost::proto::argsns_::list2<vex::vector_expression<boost::proto::exprns_::basic_expr<boost::proto::tagns_::tag::terminal, boost::proto::argsns_::term<int>, 0>>, vex::vector<int> &>, 2>>>, std::__is_implicitly_default_constructible_safe<vex::vector_expression<boost::proto::exprns_::basic_expr<boost::proto::tagns_::tag::multiplies, boost::proto::argsns_::list2<vex::vector_expression<boost::proto::exprns_::basic_expr<boost::proto::tagns_::tag::terminal, boost::proto::argsns_::term<int>, 0>>, vex::vector<int> &>, 2>>>>' requested here
    : public __and_<__is_constructible_impl<_Tp>,
             ^
/opt/cray/pe/gcc/10.3.0/snos/lib/gcc/x86_64-centos-linux/10.3.0/../../../../include/g++/type_traits:139:26: note: in instantiation of template class 'std::__is_implicitly_default_constructible<vex::vector_expression<boost::proto::exprns_::basic_expr<boost::proto::tagns_::tag::multiplies, boost::proto::argsns_::list2<vex::vector_expression<boost::proto::exprns_::basic_expr<boost::proto::tagns_::tag::terminal, boost::proto::argsns_::term<int>, 0>>, vex::vector<int> &>, 2>>>' requested here
    : public conditional<_B1::value, _B2, _B1>::type
                         ^
/opt/cray/pe/gcc/10.3.0/snos/lib/gcc/x86_64-centos-linux/10.3.0/../../../../include/g++/tuple:491:9: note: in instantiation of template class 'std::__and_<std::__is_implicitly_default_constructible<vex::vector_expression<boost::proto::exprns_::basic_expr<boost::proto::tagns_::tag::multiplies, boost::proto::argsns_::list2<vex::vector_expression<boost::proto::exprns_::basic_expr<boost::proto::tagns_::tag::terminal, boost::proto::argsns_::term<int>, 0>>, vex::vector<int> &>, 2>>>, std::__is_implicitly_default_constructible<vex::vector_expression<boost::proto::exprns_::basic_expr<boost::proto::tagns_::tag::minus, boost::proto::argsns_::list2<vex::vector<int> &, vex::vector_expression<boost::proto::exprns_::basic_expr<boost::proto::tagns_::tag::terminal, boost::proto::argsns_::term<int>, 0>>>, 2>>>>' requested here
        return __and_<std::__is_implicitly_default_constructible<_Types>...
               ^
/opt/cray/pe/gcc/10.3.0/snos/lib/gcc/x86_64-centos-linux/10.3.0/../../../../include/g++/tuple:899:6: note: in instantiation of member function 'std::_TupleConstraints<true, vex::vector_expression<boost::proto::exprns_::basic_expr<boost::proto::tagns_::tag::multiplies, boost::proto::argsns_::list2<vex::vector_expression<boost::proto::exprns_::basic_expr<boost::proto::tagns_::tag::terminal, boost::proto::argsns_::term<int>, 0>>, vex::vector<int> &>, 2>>, vex::vector_expression<boost::proto::exprns_::basic_expr<boost::proto::tagns_::tag::minus, boost::proto::argsns_::list2<vex::vector<int> &, vex::vector_expression<boost::proto::exprns_::basic_expr<boost::proto::tagns_::tag::terminal, boost::proto::argsns_::term<int>, 0>>>, 2>>>::__is_implicitly_default_constructible' requested here
            __is_implicitly_default_constructible(),
            ^
/opt/cray/pe/gcc/10.3.0/snos/lib/gcc/x86_64-centos-linux/10.3.0/../../../../include/g++/tuple:1063:66: note: in instantiation of template type alias '_ImplicitDefaultCtor' requested here
               _ImplicitDefaultCtor<is_object<_Alloc>::value, _T1, _T2> = true>
                                                                        ^
/opt/cray/pe/gcc/10.3.0/snos/lib/gcc/x86_64-centos-linux/10.3.0/../../../../include/g++/tuple:1065:2: note: while substituting prior template arguments into non-type template parameter [with _Alloc = vex::vector_expression<boost::proto::exprns_::basic_expr<boost::proto::tagns_::tag::minus, boost::proto::argsns_::list2<vex::vector<int> &, vex::vector_expression<boost::proto::exprns_::basic_expr<boost::proto::tagns_::tag::terminal, boost::proto::argsns_::term<int>, 0>>>, 2>>]
        tuple(allocator_arg_t __tag, const _Alloc& __a)
        ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/opt/cray/pe/gcc/10.3.0/snos/lib/gcc/x86_64-centos-linux/10.3.0/../../../../include/g++/tuple:1482:14: note: while substituting deduced template arguments into function template 'tuple' [with _Alloc = vex::vector_expression<boost::proto::exprns_::basic_expr<boost::proto::tagns_::tag::minus, boost::proto::argsns_::list2<vex::vector<int> &, vex::vector_expression<boost::proto::exprns_::basic_expr<boost::proto::tagns_::tag::terminal, boost::proto::argsns_::term<int>, 0>>>, 2>>, $1 = (no value)]
      return __result_type(std::forward<_Elements>(__args)...);
             ^
/lus/home/pelyakime/AMGCL/vexcl_cce/tests/vector_create.cpp:208:54: note: in instantiation of function template specialization 'std::make_tuple<const vex::vector_expression<boost::proto::exprns_::basic_expr<boost::proto::tagns_::tag::multiplies, boost::proto::argsns_::list2<vex::vector_expression<boost::proto::exprns_::basic_expr<boost::proto::tagns_::tag::terminal, boost::proto::argsns_::term<int>, 0>>, vex::vector<int> &>, 2>>, const vex::vector_expression<boost::proto::exprns_::basic_expr<boost::proto::tagns_::tag::minus, boost::proto::argsns_::list2<vex::vector<int> &, vex::vector_expression<boost::proto::exprns_::basic_expr<boost::proto::tagns_::tag::terminal, boost::proto::argsns_::term<int>, 0>>>, 2>>>' requested here
    std::tie(q, s) = vex::expression_properties(std::make_tuple(2 * x, x - 1));
                                                     ^
/opt/software/gaia/dev/1.0.3-e7de077a/__spack_path_placeholder__/__spack_path_placeholder__/__spack_path_placeholder__/__spack_p/boost-1.81.0-cce-15.0.1-cn4o/include/boost/proto/detail/preprocessed/basic_expr.hpp:212:97: note: default constructor of 'basic_expr<boost::proto::tagns_::tag::multiplies, boost::proto::argsns_::list2<vex::vector_expression<boost::proto::exprns_::basic_expr<boost::proto::tagns_::tag::terminal, boost::proto::argsns_::term<int>, 0>>, vex::vector<int> &>, 2>' is implicitly deleted because field 'child1' of reference type 'boost::proto::exprns_::basic_expr<boost::proto::tagns_::tag::multiplies, boost::proto::argsns_::list2<vex::vector_expression<boost::proto::exprns_::basic_expr<boost::proto::tagns_::tag::terminal, boost::proto::argsns_::term<int>, 0>>, vex::vector<int> &>, 2>::proto_child1' (aka 'vex::vector<int> &') would not be initialized
        typedef Arg0 proto_child0; proto_child0 child0; typedef Arg1 proto_child1; proto_child1 child1;
                                                                                                ^
1 error generated.
make[2]: *** [tests/CMakeFiles/vector_create.dir/build.make:76: tests/CMakeFiles/vector_create.dir/vector_create.cpp.o] Error 1
make[1]: *** [CMakeFiles/Makefile2:918: tests/CMakeFiles/vector_create.dir/all] Error 2
make[1]: *** Waiting for unfinished jobs....
[ 52%] Linking CXX executable devlist
[ 52%] Built target devlist
[ 53%] Linking CXX executable svm
[ 53%] Built target svm
[ 54%] Linking CXX executable types
[ 54%] Built target types
[ 55%] Linking CXX executable exclusive
[ 55%] Built target exclusive
[ 56%] Linking CXX executable multiple_objects
[ 56%] Built target multiple_objects
[ 57%] Linking CXX executable custom_kernel
[ 57%] Built target custom_kernel
[ 58%] Linking CXX executable vector_io
[ 58%] Built target vector_io
[ 60%] Linking CXX executable constants
[ 60%] Built target constants
[ 61%] Linking CXX executable mba_benchmark
[ 61%] Built target mba_benchmark
[ 62%] Linking CXX executable reinterpret
[ 62%] Built target reinterpret
[ 63%] Linking CXX executable context
[ 63%] Built target context
[ 64%] Linking CXX executable complex_simple
[ 64%] Built target complex_simple
[ 65%] Linking CXX executable eval
[ 65%] Built target eval
[ 66%] Linking CXX executable cast
[ 66%] Built target cast
[ 67%] Linking CXX executable image
[ 67%] Built target image
1 warning generated.
[ 68%] Linking CXX executable vector_copy
[ 68%] Built target vector_copy
[ 69%] Linking CXX executable multivector_create
[ 69%] Built target multivector_create
1 warning generated.
[ 70%] Linking CXX executable threads
[ 70%] Built target threads
[ 71%] Linking CXX executable logical
[ 71%] Built target logical
[ 72%] Linking CXX executable symbolic
[ 72%] Built target symbolic
[ 73%] Linking CXX executable mba
[ 73%] Built target mba
[ 74%] Linking CXX executable deduce
[ 74%] Built target deduce
[ 75%] Linking CXX executable events
[ 75%] Built target events
[ 76%] Linking CXX executable scan
[ 76%] Built target scan
[ 77%] Linking CXX executable multi_array
[ 77%] Built target multi_array
[ 78%] Linking CXX executable stencil
[ 78%] Built target stencil
[ 80%] Linking CXX executable reduce_by_key
[ 80%] Built target reduce_by_key
[ 81%] Linking CXX executable complex_spmv
[ 81%] Built target complex_spmv
[ 82%] Linking CXX executable tensordot
[ 82%] Built target tensordot
[ 83%] Linking CXX executable scan_by_key
[ 83%] Built target scan_by_key
[ 84%] Linking CXX executable tagged_terminal
[ 84%] Built target tagged_terminal
[ 85%] Linking CXX executable vector_pointer
[ 85%] Built target vector_pointer
[ 86%] Linking CXX executable generator
[ 86%] Built target generator
[ 87%] Linking CXX executable temporary
[ 87%] Built target temporary
[ 88%] Linking CXX executable random
[ 88%] Built target random
[ 89%] Linking CXX executable fft_profile
[ 89%] Built target fft_profile
[ 90%] Linking CXX executable fft_benchmark
[ 90%] Built target fft_benchmark
[ 91%] Linking CXX executable sparse_matrices
[ 91%] Built target sparse_matrices
[ 92%] Linking CXX executable vector_view
[ 92%] Built target vector_view
[ 93%] Linking CXX executable spmv
[ 93%] Built target spmv
[ 94%] Linking CXX executable multivector_arithmetics
[ 94%] Built target multivector_arithmetics
[ 95%] Linking CXX executable vector_arithmetics
[ 95%] Built target vector_arithmetics
[ 96%] Linking CXX executable fft
[ 96%] Built target fft
[ 97%] Linking CXX executable sort
[ 97%] Built target sort
[ 98%] Linking CXX executable benchmark
[ 98%] Built target benchmark
make: *** [Makefile:146: all] Error 2

ddemidov commented 1 year ago

Looks like you do have some AMD GPUs. Try to replace these lines

https://github.com/ddemidov/amgcl/blob/276a6492f69e8c70a7e45baa32db500838952352/tutorial/1.poisson3Db/poisson3Db_mpi_vexcl.cpp#L37-L42

with

std::cout << world.rank << ": " << ctx << std::endl;

pelyakim commented 1 year ago

After remplace these lines, I have a new error

cpu-bind=MASK - g1245, task  0  0 [1121470]: mask 0xffffffff00000000ffffffff set
cpu-bind=MASK - g1245, task  1  1 [1121472]: mask 0xffffffff00000000ffffffff00000000 set
terminate called after throwing an instance of 'std::runtime_error'
  what():  Empty VexCL context!
srun: error: g1245: task 1: Aborted (core dumped)
srun: launch/slurm: _step_signal: Terminating StepId=116528.0
slurmstepd: error: *** STEP 116528.0 ON g1245 CANCELLED AT 2023-04-07T11:42:43 ***
srun: error: g1245: task 0: Terminated
srun: Force Terminated StepId=116528.0

And my output file is :

1: 
0: 1. gfx90a:sramecc+:xnack- (AMD Accelerated Parallel Processing)

World size: 2
Matrix poisson3Db.bin: 85623x85623
RHS poisson3Db_b.bin: 85623x1

ddemidov commented 1 year ago

So only one of your MPI processes got a GPU (do you only have one?). The context is created in Exclusive mode here:

https://github.com/ddemidov/amgcl/blob/276a6492f69e8c70a7e45baa32db500838952352/tutorial/1.poisson3Db/poisson3Db_mpi_vexcl.cpp#L36

you can replace it with

vex::Context ctx(vex::Filter::Count(1));

but then each of your MPI processes will use the same GPU, which would not be effective (but should work). In general, it is better to start a single MPI process per GPU.

pelyakim commented 1 year ago

Ok, thanks a lot for your explications, it works now. Indeed I had reserved only 1 gpu for the test in my slurm script. I can reserve others of course. For the simulation that I want to launch later, if everything works well, I think of using several MPI processes per GPU. Indeed, my code being hybrid MPI - GPU, only the resolution of the pseudo Poisson equation representing the resolution of the pressure in the Navier-Stokes equations is solved on the GPUs, this allows to reduce the MPI part of the code, while keeping the efficiency of the GPUs. But now, I have to test with the matrix that my code builds and the 2nd member. Thanks for your help. I will keep you informed of my progress. Have a nice day

pelyakim commented 1 year ago

Hello, I tested on the example poisson3Db_mpi_vexcl with a matrix that I got from my code with one of my test cases and its second member and I managed to get a solution that seems correct. I would now like to integrate this resolution of the linear system Ax=b in my JADIM code (IMFT, France), however I have some difficulties. The partitioning is already done in the code (I have a partitioning matrix that defines the rank of the process for each cell of the mesh) and the matrix is in CSR format. Each MPI process contains the partitioned CSR matrix, i.e. each one has its part of the matrix. Example: for an MPI partitioning by 2 in X, 1 in Y and 1 in Z, rank 0 will own the first half of the matrix and rank 1 the other half. Thus, I started from the example poisson3Db_mpi_vexcl by replacing the reading of the CSR matrix and the second member by the parts of the matrix A (in fact they are pointers on the arrays row_offset, ia and val) and of the second member that each MPI process has. I copied all these arrays into vector<> . I checked that I had the same values as with my first work on the example poisson3Db_mpi_vexcl with the same matrix (but it is global) and my input vectors look good. However, when I get to the Solver solve(world, A, prm, bprm); I get an "out of range" error. I'm not sure where this could come from, could you give me an idea?

Also, my second question, could I easily use my partitioning without having to recode the MPI distribution of the matrix?

Thank you very much for your help which is very precious to me.

Here is the function where I integrate the resolution of AX=b in my code

#include <vector>
#include <iostream>
#include <ctime>

#include <amgcl/backend/vexcl.hpp>
#include <amgcl/adapter/crs_tuple.hpp>

#include <amgcl/mpi/distributed_matrix.hpp>
#include <amgcl/mpi/make_solver.hpp>
#include <amgcl/mpi/amg.hpp>
#include <amgcl/mpi/coarsening/smoothed_aggregation.hpp>
#include <amgcl/mpi/relaxation/spai0.hpp>
#include <amgcl/mpi/solver/bicgstab.hpp>

#include <amgcl/io/binary.hpp>
#include <amgcl/profiler.hpp>

// #if defined(AMGCL_HAVE_PARMETIS)
// #  include <amgcl/mpi/partition/parmetis.hpp>
// #elif defined(AMGCL_HAVE_SCOTCH)
#include <amgcl/mpi/partition/ptscotch.hpp>
// #endif

using namespace std;

extern "C" {

  void AMGCL_cg_amg_mpi(  double *matval, int *ia, int *ja,
                double *rhs_jadim, double *sol, int &nip,
                int &njp, int &nkp, int &nnz, int &npt, int &npt0, int &irovar, int &nloc, int &t_p, int &maxit )
/*  void AMGCL_cg_amg_mpi(  double *matval, int *ia, int *ja,
                double *rhs_jadim, double *sol, int &nip,
                int &njp, int &nkp, int &nnz, int &npt, int &npt0, int &irovar, int &nloc, int &t_p, MPI_Comm *comm_c_AMGCL, int &maxit ) */ //double norm )
  {
    FILE *f1, *f2, *f3, *f4;
    int nijkp = nip*njp*nkp;
    int num_procs;
    clock_t c_start, c_end;

    cout << "Check Params d'entree : " << nip << ", " <<njp << ", " << nkp << ", " << nnz << ", " << npt << ", " << npt0 << ", " << irovar << ", " << nloc << " , "<< t_p << ", " << maxit << ", " << endl;// num_procs << endl;

//     MPI_Comm_size(*comm_c_AMGCL, &num_procs);
    amgcl::mpi::communicator world(MPI_COMM_WORLD);

    // Attente de tous les procs
     MPI_Barrier(world);

    // Create VexCL context. Use vex::Filter::Exclusive so that different MPI
    // processes get different GPUs. Each process gets a single GPU:
    vex::Context ctx(vex::Filter::Exclusive(vex::Filter::Count(1)));
    std::cout << world.rank << ": " << ctx << std::endl;

     // The profiler:
    amgcl::profiler<> prof("JADIM MPI(VexCL)");

    // Read the system matrix and the RHS:
    prof.tic("read");

    // Get the global size of the matrix:
    ptrdiff_t rows_global = nijkp; //amgcl::io::crs_size<ptrdiff_t>(argv[1]);
    ptrdiff_t cols = 1;
    ptrdiff_t rows = nloc;
    ptrdiff_t chunk = nloc;

    cout << world.rank << " - rows_global :" << rows_global << "rows : " << rows <<  " cols :" << cols << " chunk: " << chunk << endl;

//     // Split the matrix into approximately equal chunks of rows_global
//     ptrdiff_t chunk = (rows_global + world.size - 1) / world.size;
//     ptrdiff_t row_beg = std::min(rows_global, chunk * t_p);
//     ptrdiff_t row_end = std::min(rows_global, row_beg + chunk);
//     chunk = row_end - row_beg;
//
//     cout << world.rank << ": chunk : " << chunk << " row_beg: " << row_beg << " row_end: " << row_end << endl;

//     amgcl::io::read_crs(argv[1], rows, row_offset, col, val, row_beg, row_end);
//     amgcl::io::read_dense(argv[2], rows, cols, rhs, row_beg, row_end);

    // ---------- 1 - Copy matval, ia, ja, rhs_jadim and sol in tempory buffer ---------

    c_start = clock();

    // Read our part of the system matrix and the RHS.
    vector<ptrdiff_t> row_offset(nloc+1), col(nnz);
    vector<double> val(nnz), rhs(nloc), in_x(nloc);

    // Copie dans des tableaux temporaires de ia, ja, matval, rhs_jadim et sol
//     cout << "Copie ia, ja et matval" << endl;

    for (int i=0; i<nloc+1; ++i) {
      row_offset[i]=ia[i];
//       if (t_p == 0 ) cout <<  row_offset[i] <<  "  " << ia[i] << endl;
    }

    for (int i=0; i<nnz; ++i) {
      col[i] = ja[i];
      val[i] = matval[i];
//       if (t_p == 0 ) cout << col[i] << " " << val[i] << endl;
    }
//     cout << "Copie rhs_jadim et sol" << endl;
    for (int i=0; i<nloc; ++i) {
      rhs[i]  = rhs_jadim[i];
      in_x[i] = sol[i];
//       if (t_p == 0) cout << rhs[i] << " " << in_x[i] << endl;
    }

    // Stop time measurement
    if (t_p == 0) cout << "Time to copy buffer : " << (clock() - c_start) / 1e6 << endl;

    prof.toc("read");

    // Copy the RHS vector to the backend:
    vex::vector<double> f(ctx, rhs);

    if (t_p == 0)
        std::cout
            << "World size: " << world.size << std::endl
            << "Matrix " << ": " << rows << "x" << rows << std::endl
            << "RHS "    << ": " << rows << "x" << cols << std::endl;

    // Compose the solver type
    typedef amgcl::backend::vexcl<double> DBackend;
    typedef amgcl::backend::vexcl<float>  FBackend;
    typedef amgcl::mpi::make_solver<
        amgcl::mpi::amg<
            FBackend,
            amgcl::mpi::coarsening::smoothed_aggregation<FBackend>,
            amgcl::mpi::relaxation::spai0<FBackend>
            >,
        amgcl::mpi::solver::bicgstab<DBackend>
        > Solver;

    cout << world.rank << " - Before make_shared" << endl;
    // Create the distributed matrix from the local parts.
    auto A = std::make_shared<amgcl::mpi::distributed_matrix<DBackend>>(
            world, std::tie(chunk, row_offset, col, val));
//     auto A = std::make_shared<amgcl::mpi::distributed_matrix<DBackend>>(
//             *comm_c_AMGCL, std::tie(chunk, row_offset, col, val));
    cout << world.rank << " - After make_shared" << endl;

    // Attente de tous les procs
     MPI_Barrier(world);

    typedef amgcl::mpi::partition::ptscotch<DBackend> Partition;

    if (world.size > 1) {
        prof.tic("partition");
        Partition part;

        // part(A) returns the distributed permutation matrix:
        auto P = part(*A);
        auto R = transpose(*P);

        // Reorder the matrix:
        A = product(*R, *product(*A, *P));

        // and the RHS vector:
        vex::vector<double> new_rhs(ctx, R->loc_rows());
        R->move_to_backend(typename DBackend::params());
        amgcl::backend::spmv(1, *R, f, 0, new_rhs);
        f.swap(new_rhs);

        // Update the number of the local rows
        // (it may have changed as a result of permutation):
        chunk = A->loc_rows();
        prof.toc("partition");
    }

    // Attente de tous les procs
    MPI_Barrier(world);

    cout << world.rank << " - After partition" << endl;

    // Initialize the solver:
    Solver::params prm;
    DBackend::params bprm;
    bprm.q = ctx;

    prof.tic("setup");
//     Solver solve(*comm_c_AMGCL, A, prm, bprm);
    Solver solve(world, A, prm, bprm);
    prof.toc("setup");

    cout << world.rank << " - After solve" << endl;

    // Show the mini-report on the constructed solver:
    if (t_p == 0)
        std::cout << solve << std::endl;

    // Solve the system with the zero initial approximation:
    int iters;
    double error;
    vex::vector<double> x(ctx, chunk);
    x = 0.0;

    prof.tic("solve");
    std::tie(iters, error) = solve(*A, f, x);
    prof.toc("solve");

    // Output the number of iterations, the relative error,
    // and the profiling data:
    if (t_p == 0)
        std::cout
            << "Iters: " << iters << std::endl
            << "Error: " << error << std::endl
            << prof << std::endl;

  }
}

pelyakim commented 1 year ago

This is my error output


The following have been reloaded with a version change:
  1) cray-libsci/22.11.1.2 => cray-libsci/23.02.1.1
  2) cray-mpich/8.1.21 => cray-mpich/8.1.24
  3) perftools-base/22.09.0 => perftools-base/23.02.0
  4) rocm/5.2.3 => rocm/5.2.0

Currently Loaded Modules:
  1) craype-network-ofi        9) cray-mpich/8.1.24
  2) craype-x86-trento        10) craype/2.7.19
  3) craype-accel-amd-gfx90a  11) perftools-base/23.02.0
  4) libfabric/1.15.2.0       12) rocm/5.2.0
  5) PrgEnv-cray/8.3.3        13) cpe/23.02
  6) cce/15.0.1               14) CPE-23.02-cce-15.0.1-GPU-softs
  7) cray-dsmml/0.2.2         15) scotch/6.1.3-mpi
  8) cray-libsci/23.02.1.1    16) boost/1.81.0-mpi-python3

cpu-bind=MASK - g1235, task  0  0 [3713689]: mask 0xffffffff00000000ffffffff set
cpu-bind=MASK - g1235, task  1  1 [3713690]: mask 0xffffffff00000000ffffffff00000000 set
terminate called after throwing an instance of 'std::out_of_range'
  what():  _Map_base::at
srun: error: g1235: task 1: Aborted (core dumped)
srun: launch/slurm: _step_signal: Terminating StepId=176864.0
slurmstepd: error: *** STEP 176864.0 ON g1235 CANCELLED AT 2023-04-13T16:44:18 ***
srun: error: g1235: task 0: Terminated
srun: Force Terminated StepId=176864.0
slurm-176864.out (END)

ddemidov commented 1 year ago

Try to read this page: https://amgcl.readthedocs.io/en/latest/tutorial/poisson3DbMPI.html

There I tried to explain what amgcl expects from the partitioned matrix. In short, each MPI process should contain consecutive row-wise chanks of the matrix, and the columns should have global numbering.

ddemidov / fortran_amg_omp_ocl

AMGCL on AMD GPU #1