I'm running a case on ALCF's Polaris machine. v24.0.1 (sha:a869ca69). 2.5M elements. 64 nodes, 4 ranks per node. Poly order = 9. I get the following error:
pack/unpack host + hostBuffer MPI using pw: 5.7752e-03s
pack/unpack device + hostBuffer MPI using pw: 2.1447e-03s
pack/unpack device + hostBuffer MPI using nbc: 2.1390e-03s
pack/unpack device + deviceBuffer MPI using pw: 1.1417e-03s
MPI min/max/avg: 3.14e-05s 7.60e-04s 3.87e-04s / avg bi-bw: 14.5GB/s/rank
autotuning gs for wordSize=8 nFields=1
local: 1.8678e-04s (556.3GB/s)
pack/unpack host + hostBuffer MPI using pw: 1.8038e-03s
pack/unpack device + hostBuffer MPI using pw: 9.2357e-04s
pack/unpack device + hostBuffer MPI using nbc: 8.2887e-04s
pack/unpack device + deviceBuffer MPI using pw: 3.8060e-04
MPI min/max/avg: 2.43e-05s 2.49e-04s 1.40e-04s / avg bi-bw: 14.7GB/s/rank
Checking restart options: reCyc_LM0.fld INT TIME=0
Reading checkpoint data
call gfldr reCyc_LM0.fld
Error in crystal_router: rank = 49 send_n = 2197376280 (> INT_MAX)
MPICH ERROR [Rank 49] [job id 26d6648b-e715-408b-baee-48ecdfca6968] [Mon Sep 23 02:49:15 2024] [x3102c0s7b0n0] - Abort(1) (rank 49 in comm 848): application called MPI_Abort(comm=0xC4000025, 1) - process 49
I'm doing a restart where I interpolate a velocity field from another simulation with different and smaller mesh [37K els,N=9] onto a larger mesh. Is this error related to restart?
I'm running a case on ALCF's Polaris machine. v24.0.1 (sha:a869ca69). 2.5M elements. 64 nodes, 4 ranks per node. Poly order = 9. I get the following error:
I'm doing a restart where I interpolate a velocity field from another simulation with different and smaller mesh [37K els,N=9] onto a larger mesh. Is this error related to restart?