StanfordLegion / legion

The Legion Parallel Programming System
https://legion.stanford.edu
Apache License 2.0
657 stars 146 forks source link

Nondeterministic hang in S3D on Frontier #1696

Open elliottslaughter opened 1 month ago

elliottslaughter commented 1 month ago

I am tracking a nondeterministic hang in S3D on Frontier that happens maybe ~25% of the time at 256 nodes and above.

This is notable because I thought I had resolved #1657, and in fact I had successfully run that version of the code up to 8,192 nodes on Frontier multiple times. It is possible that my configuration is not as rock solid as I thought it was. However, Seshu has recently been running S3D on a slightly older version of Legion, and so far he has not seen any hangs, even up to 8,192 nodes. He's done enough runs at this point that he probably would have seen the hang given the repro frequencies I'm observing.

Here's what I know about Seshu's vs. my configuration:

Seshu:

My configuration:

At this point the two most likely directions for debugging are bisecting the variables I have set, and bisecting the Legion commit ID.

elliottslaughter commented 1 month ago

For posterity:

Seshu's variables:

export FI_MR_CACHE_MONITOR=memhooks
export FI_CXI_RX_MATCH_MODE=software
export GASNET_OFI_DEVICE_0=cxi2
export GASNET_OFI_DEVICE_1=cxi1
export GASNET_OFI_DEVICE_2=cxi3
export GASNET_OFI_DEVICE_3=cxi0
export GASNET_OFI_DEVICE_TYPE=Node
export GASNET_OFI_RECEIVE_BUFF_SIZE=8M
export MPICH_OFI_NIC_POLICY=BLOCK
# export GASNET_OFI_RECEIVE_BUFF_SIZE=recv
# export FI_CXI_DEFAULT_CQ_SIZE=13107200
# export FI_CXI_REQ_BUF_MIN_POSTED=10
# export FI_CXI_REQ_BUF_SIZE=25165824
# export MPICH_MAX_THREAD_SAFETY=multiple
# export MPICH_OFI_NIC_POLICY=NUMA

My variables:

export FI_MR_CACHE_MONITOR=memhooks
export FI_CXI_RX_MATCH_MODE=software
# export FI_CXI_RDZV_THRESHOLD=256
# export FI_CXI_RDZV_GET_MIN=256
export FI_CXI_OFLOW_BUF_SIZE=16777216
export GASNET_OFI_DEVICE_0=cxi2
export GASNET_OFI_DEVICE_1=cxi3
export GASNET_OFI_DEVICE_TYPE=HRank
# export GASNET_OFI_LIST_DEVICES=1
# export GASNET_SPAWN_VERBOSE=1
# export GASNET_OFI_RECEIVE_BUFF_SIZE=recv
# export GASNET_OFI_NUM_RECEIVE_BUFFS=50000
export GASNET_OFI_NUM_RECEIVE_BUFFS=32M
export MPICH_OFI_NIC_POLICY=USER
export MPICH_OFI_NIC_MAPPING=1:0-1
# export MPICH_OFI_VERBOSE=1
# export MPICH_OFI_NIC_VERBOSE=2
elliottslaughter commented 1 month ago

Backtraces for a frozen 1024 node job, in my configuration. Note this does not have -ll:force_kthreads, so the utility is somewhat dubious, but at least I do not see frozen network threads here:

http://sapling2.stanford.edu/~eslaught/bug1696/s3d-2024-05-retest-try2-ammonia-scaling-gpu-kinds-32M/bt1024-1/