Nek5000 / nekRS

our next generation fast and scalable CFD code
https://nek5000.mcs.anl.gov/
Other
290 stars 76 forks source link

Error in crystal_router: #588

Closed spatel81 closed 2 months ago

spatel81 commented 2 months ago

I'm running a case on ALCF's Polaris machine. v24.0.1 (sha:a869ca69). 2.5M elements. 64 nodes, 4 ranks per node. Poly order = 9. I get the following error:

pack/unpack host + hostBuffer MPI using pw: 5.7752e-03s 
pack/unpack device + hostBuffer MPI using pw: 2.1447e-03s 
pack/unpack device + hostBuffer MPI using nbc: 2.1390e-03s 
pack/unpack device + deviceBuffer MPI using pw: 1.1417e-03s 
MPI min/max/avg: 3.14e-05s 7.60e-04s 3.87e-04s / avg bi-bw: 14.5GB/s/rank
autotuning gs for wordSize=8 nFields=1 
local: 1.8678e-04s (556.3GB/s)
pack/unpack host + hostBuffer MPI using pw: 1.8038e-03s 
pack/unpack device + hostBuffer MPI using pw: 9.2357e-04s 
pack/unpack device + hostBuffer MPI using nbc: 8.2887e-04s 
pack/unpack device + deviceBuffer MPI using pw: 3.8060e-04
MPI min/max/avg: 2.43e-05s 2.49e-04s 1.40e-04s / avg bi-bw: 14.7GB/s/rank
 Checking restart options: reCyc_LM0.fld  INT TIME=0                                                                
 Reading checkpoint data 
 call gfldr reCyc_LM0.fld
Error in crystal_router: rank = 49 send_n = 2197376280 (> INT_MAX)
MPICH ERROR [Rank 49] [job id 26d6648b-e715-408b-baee-48ecdfca6968] [Mon Sep 23 02:49:15 2024] [x3102c0s7b0n0] - Abort(1) (rank 49 in comm 848): application called MPI_Abort(comm=0xC4000025, 1) - process 49

I'm doing a restart where I interpolate a velocity field from another simulation with different and smaller mesh [37K els,N=9] onto a larger mesh. Is this error related to restart?

spatel81 commented 2 months ago

For what its worth, the error does not appear when I am not doing a restart.