LLNL / AMG2023

Algebraic multigrid solver
https://github.com/LLNL/AMG2023
Apache License 2.0
7 stars 10 forks source link

segmentation fault with >= 64 nodes on Frontier #13

Open BenWibking opened 3 months ago

BenWibking commented 3 months ago

I can run problem 1 successfully on Frontier with < 64 nodes fine, but I get a segmentation fault with >= 64 nodes:

Running with these driver parameters:
  Problem ID    = 1

=============================================
Hypre init times:
=============================================
Hypre init:
  wall clock time = 0.000006 seconds
  Laplacian_27pt:
    (Nx, Ny, Nz) = (1600, 1600, 1600)
    (Px, Py, Pz) = (8, 8, 8)

srun: error: frontier04522: tasks 282-287: Segmentation fault
srun: Terminating StepId=2131722.0

with Segmentation fault errors reported for all of the other MPI ranks as well.

I built Hypre v2.31.0 with:

./configure --with-hip --with-gpu-arch=gfx90a --with-MPI-lib-dirs="${MPICH_DIR}/lib" --with-MPI-libs="mpi" --with-MPI-include="${MPICH_DIR}/include" --enable-mixedint

with cce/17.0.0, rocm/5.7.1, and cray-mpich/8.1.28.

I'm running the problem with:

#SBATCH --ntasks-per-node=8
#SBATCH --cpus-per-task=7
#SBATCH --gpus-per-task=1
#SBATCH --gpu-bind=closest
#SBATCH -N 64

export LD_LIBRARY_PATH=${CRAY_LD_LIBRARY_PATH}:${LD_LIBRARY_PATH}
export MPICH_GPU_SUPPORT_ENABLED=1

srun ./amg -problem 1 -n 200 200 200 -P 8 8 8
ulrikeyang commented 3 months ago

Did you configure hypre with –enable-mixed-int? If not, the problem will be too big for 32bit integers.

From: Ben Wibking @.> Sent: Thursday, July 25, 2024 12:40 PM To: LLNL/AMG2023 @.> Cc: Subscribed @.***> Subject: [LLNL/AMG2023] segmentation fault with >= 64 nodes on Frontier (Issue #13)

I can run problem 1 successfully on Frontier with < 64 nodes fine, but I get a segmentation fault with >= 64 nodes:

Running with these driver parameters:

Problem ID = 1

=============================================

Hypre init times:

=============================================

Hypre init:

wall clock time = 0.000006 seconds

Laplacian_27pt:

(Nx, Ny, Nz) = (1600, 1600, 1600)

(Px, Py, Pz) = (8, 8, 8)

srun: error: frontier04522: tasks 282-287: Segmentation fault

srun: Terminating StepId=2131722.0

with Segmentation fault errors reported for all of the other MPI ranks as well.

I built Hypre v2.31.0 with:

./configure --with-hip --with-gpu-arch=gfx90a --with-MPI-lib-dirs="${MPICH_DIR}/lib" --with-MPI-libs="mpi" --with-MPI-include="${MPICH_DIR}/include" --enable-mixedint

with cce/17.0.0, rocm/5.7.1, and cray-mpich/8.1.28.

I'm running the problem with:

srun ./amg -problem 1 -n 200 200 200 -P 8 8 8

— Reply to this email directly, view it on GitHubhttps://urldefense.us/v2/url?u=https-3A__github.com_LLNL_AMG2023_issues_13&d=DwMCaQ&c=pKoAVQro6qDbLoK0T8588B4mZJhJhC4e6QXJy0XnJec&r=TQu1MQ9CDqka0jKA8Y4yHQ&m=yNZNWu0YJB5PPTPtaz-_IkP8lcCfY7AmFAbImSyujAltOquECbednzEn8llIN_ey&s=gG618ULU0Ja9oUzUDY4VKndS0N2NvOzxqtcCzlJMxso&e=, or unsubscribehttps://urldefense.us/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_AD4NLLJOQ2N4G2RRTIQ2QYTZOFH75AVCNFSM6AAAAABLPGCFPWVHI2DSMVQWIX3LMV43ASLTON2WKOZSGQZTAOBVHA4DSOA&d=DwMCaQ&c=pKoAVQro6qDbLoK0T8588B4mZJhJhC4e6QXJy0XnJec&r=TQu1MQ9CDqka0jKA8Y4yHQ&m=yNZNWu0YJB5PPTPtaz-_IkP8lcCfY7AmFAbImSyujAltOquECbednzEn8llIN_ey&s=RgNPpenuM-oRzlGw5IVKQzonWdUNQ5orhg5ZBdBLfEY&e=. You are receiving this because you are subscribed to this thread.Message ID: @.**@.>>

BenWibking commented 3 months ago

I configured it with --enable-mixedint:

./configure --with-hip --with-gpu-arch=gfx90a --with-MPI-lib-dirs="${MPICH_DIR}/lib" --with-MPI-libs="mpi" --with-MPI-include="${MPICH_DIR}/include" --enable-mixedint
liruipeng commented 3 months ago

Thank you for reporting this issue. I will take a look and get back to you soon.

BenWibking commented 3 months ago

Is there an update on this? I am still seeing this issue on Frontier.