Open KennethEJansen opened 8 months ago
@jedbrown @jrwrigh feel free to describe anything that @brtnfld and others developing CGNS might need to know that I missed in the above description
Do a few ranks only print that message? Some ranks may not have something to read at that scale, and we may not have the correct bail-out for that situation.
I've run CGNS with 43k ranks with no issue, but that was with 43k ranks reading a file created with 43k ranks. Compiling CGNS with -DADFH_DEBUG_ON, or uncomment #define ADFH_DEBUG_ON in ADFH.c might help with the diagnostics. However, that will produce a ton of output at that number of ranks. It might be helpful to determine the smallest rank count that the problem occurs.
If you can provide me access to the file on Aurora, I can look into it. If you have a simple reproducer, that would also help.
Let me know when you get to the DAOS phase, as CGNS will need the fixes mentioned in #613. I will try to get the fixes in branch CGNS218 into develop. If you continue with Lustre, you will likely want to consider using HDF5 subfiling.
Thanks for the advice. I did not realize that CGNS did anything differently when reading a file with m processes that was written by n processes when n is not equal to m. I thought there was no concept of prior partition.
Answering your first question
kjansen@aurora-uan-0010:/lus/gecko/projects/PHASTA_aesp_CNDA/BumpQ2> grep "CGNS error 1 mismatch" JZ1536Nodes0515_240108.o618823 |wc
48 672 3984
suggests that at least only 48 of the 18432 processes that I expected to participate in reading the file (that is at least how I chunked it out on each read line) are reporting this error but this is also relying on PETSc error reporting e.g.
kjansen@aurora-uan-0010:/lus/gecko/projects/PHASTA_aesp_CNDA/BumpQ2> grep "14163" JZ1536Nodes0515_240108.o618823 |grep -v ": -" |grep -v ":-"
[14163]PETSC ERROR: Error in external library
[14163]PETSC ERROR: CGNS error 1 mismatch in number of children and child IDs read
[14163]PETSC ERROR: WARNING! There are unused option(s) set! Could be the program crashed before usage or a spelling mistake, etc!
[14163]PETSC ERROR: See https://petsc.org/release/faq/ for trouble shooting.
[14163]PETSC ERROR: Petsc Development GIT revision: v3.19.5-1858-g581ad989054 GIT Date: 2024-02-12 14:59:06 -0700
[14163]PETSC ERROR: Configure options --with-debugging=0 --with-mpiexec-tail=gpu_tile_compact.sh --with-64-bit-indices --with-cc=mpicc --with-cxx=mpicxx --with-fc=0 --COPTFLAGS=-O2 --CXXOPTFLAGS=-O2 --FOPTFLAGS=-O2 --SYCLPPFLAGS=-Wno-tautological-constant-compare --SYCLOPTFLAGS=-O2 --download-kokkos --download-kokkos-kernels --download-kokkos-commit=origin/develop --download-kokkos-kernels-commit=origin/develop --download-hdf5 --download-cgns --download-metis --download-parmetis --download-ptscotch=../scotch_7.0.4beta3.tar.gz --with-sycl --with-syclc=icpx --with-sycl-arch=pvc --PETSC_ARCH=05-15_RB240108_B_JZ
[14163]PETSC ERROR: #1 DMPlexCreateCGNSFromFile_Internal() at /lus/gecko/projects/PHASTA_aesp_CNDA/petsc-kjansen-fork0515006/src/dm/impls/plex/cgns/plexcgns2.c:187
[14163]PETSC ERROR: #2 DMPlexCreateCGNSFromFile() at /lus/gecko/projects/PHASTA_aesp_CNDA/petsc-kjansen-fork0515006/src/dm/impls/plex/plexcgns.c:29
[14163]PETSC ERROR: #3 DMPlexCreateFromFile() at /lus/gecko/projects/PHASTA_aesp_CNDA/petsc-kjansen-fork0515006/src/dm/impls/plex/plexcreate.c:5921
[14163]PETSC ERROR: #4 DMPlexCreateFromOptions_Internal() at /lus/gecko/projects/PHASTA_aesp_CNDA/petsc-kjansen-fork0515006/src/dm/impls/plex/plexcreate.c:3943
[14163]PETSC ERROR: #5 DMSetFromOptions_Plex() at /lus/gecko/projects/PHASTA_aesp_CNDA/petsc-kjansen-fork0515006/src/dm/impls/plex/plexcreate.c:4465
[14163]PETSC ERROR: #6 DMSetFromOptions() at /lus/gecko/projects/PHASTA_aesp_CNDA/petsc-kjansen-fork0515006/src/dm/interface/dm.c:905
[14163]PETSC ERROR: #7 CreateDM() at /lus/gecko/projects/PHASTA_aesp_CNDA/libCEED_0515006_240108_JZ/examples/fluids/src/setupdm.c:36
[14163]PETSC ERROR: #8 main() at /lus/gecko/projects/PHASTA_aesp_CNDA/libCEED_0515006_240108_JZ/examples/fluids/navierstokes.c:159
[14163]PETSC ERROR: PETSc Option Table entries:
Abort(76) on node 14163 (rank 0 in comm 16): application called MPI_Abort(MPI_COMM_SELF, 76) - process 0
so there are 48 that have the CGNS reported error
kjansen@aurora-uan-0010:/lus/gecko/projects/PHASTA_aesp_CNDA/BumpQ2> grep "CGNS error 1 mismatch" JZ1536Nodes0515_240108.o618823
[14163]PETSC ERROR: CGNS error 1 mismatch in number of children and child IDs read
[14174]PETSC ERROR: CGNS error 1 mismatch in number of children and child IDs read
[14164]PETSC ERROR: CGNS error 1 mismatch in number of children and child IDs read
[14175]PETSC ERROR: CGNS error 1 mismatch in number of children and child IDs read
[14167]PETSC ERROR: CGNS error 1 mismatch in number of children and child IDs read
[14179]PETSC ERROR: CGNS error 1 mismatch in number of children and child IDs read
[14168]PETSC ERROR: CGNS error 1 mismatch in number of children and child IDs read
[14180]PETSC ERROR: CGNS error 1 mismatch in number of children and child IDs read
[14182]PETSC ERROR: CGNS error 1 mismatch in number of children and child IDs read
[14171]PETSC ERROR: CGNS error 1 mismatch in number of children and child IDs read
[14172]PETSC ERROR: CGNS error 1 mismatch in number of children and child IDs read
[14160]PETSC ERROR: CGNS error 1 mismatch in number of children and child IDs read
[14173]PETSC ERROR: CGNS error 1 mismatch in number of children and child IDs read
[14161]PETSC ERROR: CGNS error 1 mismatch in number of children and child IDs read
[14162]PETSC ERROR: CGNS error 1 mismatch in number of children and child IDs read
[14176]PETSC ERROR: CGNS error 1 mismatch in number of children and child IDs read
[14177]PETSC ERROR: CGNS error 1 mismatch in number of children and child IDs read
[14178]PETSC ERROR: CGNS error 1 mismatch in number of children and child IDs read
[14165]PETSC ERROR: CGNS error 1 mismatch in number of children and child IDs read
[14181]PETSC ERROR: CGNS error 1 mismatch in number of children and child IDs read
[14166]PETSC ERROR: CGNS error 1 mismatch in number of children and child IDs read
[14183]PETSC ERROR: CGNS error 1 mismatch in number of children and child IDs read
[14169]PETSC ERROR: CGNS error 1 mismatch in number of children and child IDs read
[14170]PETSC ERROR: CGNS error 1 mismatch in number of children and child IDs read
[14184]PETSC ERROR: CGNS error 1 mismatch in number of children and child IDs read
[14185]PETSC ERROR: CGNS error 1 mismatch in number of children and child IDs read
[14186]PETSC ERROR: CGNS error 1 mismatch in number of children and child IDs read
[14196]PETSC ERROR: CGNS error 1 mismatch in number of children and child IDs read
[14187]PETSC ERROR: CGNS error 1 mismatch in number of children and child IDs read
[14197]PETSC ERROR: CGNS error 1 mismatch in number of children and child IDs read
[14188]PETSC ERROR: CGNS error 1 mismatch in number of children and child IDs read
[14198]PETSC ERROR: CGNS error 1 mismatch in number of children and child IDs read
[14189]PETSC ERROR: CGNS error 1 mismatch in number of children and child IDs read
[14199]PETSC ERROR: CGNS error 1 mismatch in number of children and child IDs read
[14190]PETSC ERROR: CGNS error 1 mismatch in number of children and child IDs read
[14200]PETSC ERROR: CGNS error 1 mismatch in number of children and child IDs read
[14191]PETSC ERROR: CGNS error 1 mismatch in number of children and child IDs read
[14201]PETSC ERROR: CGNS error 1 mismatch in number of children and child IDs read
[14192]PETSC ERROR: CGNS error 1 mismatch in number of children and child IDs read
[14202]PETSC ERROR: CGNS error 1 mismatch in number of children and child IDs read
[14193]PETSC ERROR: CGNS error 1 mismatch in number of children and child IDs read
[14203]PETSC ERROR: CGNS error 1 mismatch in number of children and child IDs read
[14194]PETSC ERROR: CGNS error 1 mismatch in number of children and child IDs read
[14204]PETSC ERROR: CGNS error 1 mismatch in number of children and child IDs read
[14195]PETSC ERROR: CGNS error 1 mismatch in number of children and child IDs read
[14205]PETSC ERROR: CGNS error 1 mismatch in number of children and child IDs read
[14206]PETSC ERROR: CGNS error 1 mismatch in number of children and child IDs read
[14207]PETSC ERROR: CGNS error 1 mismatch in number of children and child IDs read
I have not sorted but the tight node range might be just 4 nodes with 12 processes are not happy? @jedbrown or @jrwrigh will know better but, for example
kjansen@aurora-uan-0010:/lus/gecko/projects/PHASTA_aesp_CNDA/BumpQ2> grep "13163" JZ1536Nodes0515_240108.o618823
Returns nothing so it seems the other processes (other 762 nodes if these are indeed all on the same 4 nodes) are not getting this error or at least not reporting it through PETSc.
Can you explain Some ranks may not have something to read at that scale? I think we spread out the node and element read range evenly, even for boundary elements (which was a rendezvous headache to get them back on ranks that have better correspondence to their node range).
Thanks for the flags to get more verbose output even if it will be a huge haystack to sift for the needle.
Since it works fine at 96,192,384, and 768 nodes, I don't know how to make a small reproducer.
If you have an account on Aurora, give me your username and I will ask support to add you to our group as these files are in our group space and readable by anyone I add.
Can you point us to information on HDF5 subfiling? This might be more promising than debugging a case that is beyond the limits of Lustre.
You are correct. By default, there are no rank dependencies for a CGNS file unless an application introduces such dependencies, such as different zones for each rank.
I wanted to know if the data was partitioned for the larger scale case such that it could be a scenario where some ranks might not have some data condition to read.
Do you always need to double the nodes for the next rank jump? For example, can't you run with 576 nodes?
General subfiling info is here: https://github.com/HDFGroup/hdf5doc/blob/master/RFCs/HDF5_Library/VFD_Subfiling/user_guide/HDF5_Subfiling_VFD_User_s_Guide.pdf
I've not merged the CGNS "subfiling" branch into develop. I've tested it on Summit and Frontier and will have some Aurora results shortly. I still need to document its usage and best practices.
If you list "ls -tr" home on Aurora, my username is obvious. Otherwise, I can send it to you offline.
Which version of CGNS and HDF5 are you using?
Thanks again for the response.
Our file is "flat" in the sense that it is a single zone and we are expecting all ranks to read a range of the data that is size/nranks.
No requirement to double its just what I usually do. In any event, 1536 is not as big as we want to go but sure, we can try 1024 or any other number between works at 768 and fails at 1536.
Thanks for the link and the "status" and yes eager for documentation on its usage and best practices as I am very much a CGNS newbie (who dove into the Exascale usage as the first experience).
I will find your username and request that you be added to our projects shortly.
Request for you to be added sent but no response yet so it might be a while. In the interim @jedbrown suggested
lfs setstripe -c 16 .
to set the directory's Lustre properties, copying the file such that it gets those properties matched, and we are testing to see if that improves things. Do you have any advice as to whether those are the best settings for Aurora?
A stripe count of 16 is a good starting point; I've seen good results on Frontier with a stripe count of 64 and a stripe size of 16 MiB.
Which version of HDF5 are you using?
In the spirit of push it until it breaks mode, @jedbrown suggested -1
and this produced a hang with 192 nodes (each with 12 processes) reading a file written originally with 16 but then "copied" after setting the dir to -1
#0 0x00001502efe3cee1 in MPIR_Allreduce () from /soft/restricted/CNDA/updates/mpich/52.2/mpich-ofi-all-icc-default-pmix-gpu-drop52/lib/libmpi.so.12
No symbol table info available.
#1 0x00001502ef445c7e in PMPI_Allreduce () from /soft/restricted/CNDA/updates/mpich/52.2/mpich-ofi-all-icc-default-pmix-gpu-drop52/lib/libmpi.so.12
No symbol table info available.
#2 0x00001502f38a88f3 in PMPI_File_set_view () from /soft/restricted/CNDA/updates/mpich/52.2/mpich-ofi-all-icc-default-pmix-gpu-drop52/lib/libmpi.so.12
No symbol table info available.
#3 0x00001502e3855c61 in ?? () from /lus/gecko/projects/PHASTA_aesp_CNDA/petsc-kjansen-forkJZ/12-15-001_RB240108_precise/lib/libhdf5.so.310
No symbol table info available.
#4 0x00001502e35fce32 in H5FD_read_selection () from /lus/gecko/projects/PHASTA_aesp_CNDA/petsc-kjansen-forkJZ/12-15-001_RB240108_precise/lib/libhdf5.so.310
No symbol table info available.
#5 0x00001502e35e3430 in H5F_shared_select_read () from /lus/gecko/projects/PHASTA_aesp_CNDA/petsc-kjansen-forkJZ/12-15-001_RB240108_precise/lib/libhdf5.so.310
No symbol table info available.
#6 0x00001502e35902bf in H5D__contig_read () from /lus/gecko/projects/PHASTA_aesp_CNDA/petsc-kjansen-forkJZ/12-15-001_RB240108_precise/lib/libhdf5.so.310
No symbol table info available.
#7 0x00001502e35a4c7b in H5D__read () from /lus/gecko/projects/PHASTA_aesp_CNDA/petsc-kjansen-forkJZ/12-15-001_RB240108_precise/lib/libhdf5.so.310
No symbol table info available.
#8 0x00001502e380f7ec in H5VL__native_dataset_read () from /lus/gecko/projects/PHASTA_aesp_CNDA/petsc-kjansen-forkJZ/12-15-001_RB240108_precise/lib/libhdf5.so.310
No symbol table info available.
#9 0x00001502e37fade3 in H5VL_dataset_read_direct () from /lus/gecko/projects/PHASTA_aesp_CNDA/petsc-kjansen-forkJZ/12-15-001_RB240108_precise/lib/libhdf5.so.310
No symbol table info available.
#10 0x00001502e357514e in ?? () from /lus/gecko/projects/PHASTA_aesp_CNDA/petsc-kjansen-forkJZ/12-15-001_RB240108_precise/lib/libhdf5.so.310
No symbol table info available.
#11 0x00001502e3574c9d in H5Dread () from /lus/gecko/projects/PHASTA_aesp_CNDA/petsc-kjansen-forkJZ/12-15-001_RB240108_precise/lib/libhdf5.so.310
No symbol table info available.
#12 0x00001502e39f4cde in readwrite_data_parallel () from /lus/gecko/projects/PHASTA_aesp_CNDA/petsc-kjansen-forkJZ/12-15-001_RB240108_precise/lib/libcgns.so.4.3
No symbol table info available.
#13 0x00001502e39f601a in cgp_elements_read_data () from /lus/gecko/projects/PHASTA_aesp_CNDA/petsc-kjansen-forkJZ/12-15-001_RB240108_precise/lib/libcgns.so.4.3
No symbol table info available.
#14 0x000015030dc421e0 in DMPlexCreateCGNS_Internal () from /lus/gecko/projects/PHASTA_aesp_CNDA/petsc-kjansen-forkJZ/12-15-001_RB240108_precise/lib/libpetsc.so.3.020
No symbol table info available.
#15 0x000015030daaacc6 in DMPlexCreateCGNS () from /lus/gecko/projects/PHASTA_aesp_CNDA/petsc-kjansen-forkJZ/12-15-001_RB240108_precise/lib/libpetsc.so.3.020
No symbol table info available.
#16 0x000015030dc418b5 in DMPlexCreateCGNSFromFile_Internal () from /lus/gecko/projects/PHASTA_aesp_CNDA/petsc-kjansen-forkJZ/12-15-001_RB240108_precise/lib/libpetsc.so.3.020
No symbol table info available.
#17 0x000015030daaac76 in DMPlexCreateCGNSFromFile () from /lus/gecko/projects/PHASTA_aesp_CNDA/petsc-kjansen-forkJZ/12-15-001_RB240108_precise/lib/libpetsc.so.3.020
No symbol table info available.
#18 0x000015030daccfd0 in DMPlexCreateFromFile () from /lus/gecko/projects/PHASTA_aesp_CNDA/petsc-kjansen-forkJZ/12-15-001_RB240108_precise/lib/libpetsc.so.3.020
No symbol table info available.
#19 0x000015030dad4fc7 in DMSetFromOptions_Plex () from /lus/gecko/projects/PHASTA_aesp_CNDA/petsc-kjansen-forkJZ/12-15-001_RB240108_precise/lib/libpetsc.so.3.020
No symbol table info available.
#20 0x000015030d9435f9 in DMSetFromOptions () from /lus/gecko/projects/PHASTA_aesp_CNDA/petsc-kjansen-forkJZ/12-15-001_RB240108_precise/lib/libpetsc.so.3.020
No symbol table info available.
#21 0x000000000047d99d in CreateDM ()
No symbol table info available.
#22 0x000000000040c2b1 in main ()
No symbol table info available.
[Inferior 1 (process 4399) detached]
[New LWP 4476]
[New LWP 4488]
I have 12 of these in each of the 192 node files for us to digest.
Answering your question, PETSc "chooses" the version of HDF5 and it is: hdf5-1.14.3-p1
or from the configure log
install: Retrieving https://web.cels.anl.gov/projects/petsc/download/externalpackages/hdf5-1.14.3-p1.tar.bz2 as tarball to /lus/gecko/projects/PHASTA_aesp_CNDA/petsc-kjansen-forkJZ/12-15-001_RB240108_B/externalpackages/_d_hdf5-1.14.3-p1.tar.bz2
/hdf
kjansen@aurora-uan-0009:~> grep "#0 " out192ReadHang..* |grep MPIR_Allreduce |wc
1496 10472 251977
so that leaves 808 processes (12*192-1496) doing something else.
For reasons we are still sorting out, we seem to get 12 control processes as well but filtering these with what they seem to all be doing on BT #0 which is wait4:
kjansen@aurora-uan-0009:~> grep "#0 " out192ReadHang..* |grep -v MPIR_Allreduce | grep -v wait4 |head
out192ReadHang..0:#0 0x00001471d09030a9 in poll () from /lib64/libc.so.6
out192ReadHang..0:#0 ofi_genlock_lock (lock=0x4d63490) at ./include/ofi_lock.h:359
out192ReadHang..0:#0 0x0000148ddc7403e0 in ofi_mutex_unlock_noop () at src/common.c:996
out192ReadHang..0:#0 0x000015230a8dc9b3 in pthread_spin_lock () from /lib64/libpthread.so.0
out192ReadHang..1:#0 0x000014abd7887d19 in MPIDU_genq_shmem_queue_dequeue () from /soft/restricted/CNDA/updates/mpich/52.2/mpich-ofi-all-icc-default-pmix-gpu-drop52/lib/libmpi.so.12
out192ReadHang..1:#0 0x0000150ec7027af4 in cxip_ep_ctrl_eq_progress (ep_obj=0x4981620, ctrl_evtq=0x4947618, tx_evtq=true, ep_obj_locked=false) at prov/cxi/src/cxip_ctrl.c:320
out192ReadHang..1:#0 cxip_ep_ctrl_eq_progress (ep_obj=0x4ec9900, ctrl_evtq=0x4ed3698, tx_evtq=false, ep_obj_locked=false) at prov/cxi/src/cxip_ctrl.c:320
out192ReadHang..1:#0 0x0000152af605d0ed in ofi_cq_readfrom (cq_fid=0x5ce1290, buf=<optimized out>, count=8, src_addr=0x0) at prov/util/src/util_cq.c:230
out192ReadHang..1:#0 cxip_ep_ctrl_progress (ep_obj=0x5583110) at prov/cxi/src/cxip_ctrl.c:374
out192ReadHang..10:#0 0x0000146f67dd2b90 in cxip_ep_ctrl_eq_progress (ep_obj=<optimized out>, ctrl_evtq=<optimized out>, tx_evtq=true, ep_obj_locked=<optimized out>) at prov/cxi/src/cxip_ctrl.c:355
kjansen@aurora-uan-0009:~> grep "#0 " out192ReadHang..* |grep -v MPIR_Allreduce | grep -v wait4 |wc
809 6030 100729
diving in on the variation of states for node 0 to see if that tells us anything (here us means somebody else because if I understood it I would not be dumping all this stuff here).
```
kjansen@aurora-uan-0009:~> grep "#1 " out192ReadHang..0
kjansen@aurora-uan-0009:~> grep "#2 " out192ReadHang..0
kjansen@aurora-uan-0009:~> grep "#3 " out192ReadHang..0
kjansen@aurora-uan-0009:~> grep "#4 " out192ReadHang..0
Slicing the other way, here is where the first 200ish of the 809 processes that are not at the MPIR_Allreduce
in case that tells anyone anything (I can provide more obviously but unsure how helpful this is.
grep "#0 " out192ReadHang..* |grep -v MPIR_Allreduce | grep -v wait4 > 809DoingWhat.log
pasted lines from 809DoingWhat.log
[?4m2ReadHang..0:#0 0x00001471d09030a9 in poll () from /lib64/libc.so.6
out192ReadHang..0:#0 ofi_genlock_lock (lock=0x4d63490) at ./include/ofi_lock.h:359
out192ReadHang..0:#0 0x0000148ddc7403e0 in ofi_mutex_unlock_noop () at src/common.c:996
out192ReadHang..0:#0 0x000015230a8dc9b3 in pthread_spin_lock () from /lib64/libpthread.so.0
out192ReadHang..1:#0 0x000014abd7887d19 in MPIDU_genq_shmem_queue_dequeue () from /soft/restricted/CNDA/updates/mpich/52.2/mpich-ofi-all-icc-default-pmix-gpu-drop52/lib/libmpi.so.12
out192ReadHang..1:#0 0x0000150ec7027af4 in cxip_ep_ctrl_eq_progress (ep_obj=0x4981620, ctrl_evtq=0x4947618, tx_evtq=true, ep_obj_locked=false) at prov/cxi/src/cxip_ctrl.c:320
out192ReadHang..1:#0 cxip_ep_ctrl_eq_progress (ep_obj=0x4ec9900, ctrl_evtq=0x4ed3698, tx_evtq=false, ep_obj_locked=false) at prov/cxi/src/cxip_ctrl.c:320
out192ReadHang..1:#0 0x0000152af605d0ed in ofi_cq_readfrom (cq_fid=0x5ce1290, buf=<optimized out>, count=8, src_addr=0x0) at prov/util/src/util_cq.c:230
out192ReadHang..1:#0 cxip_ep_ctrl_progress (ep_obj=0x5583110) at prov/cxi/src/cxip_ctrl.c:374
out192ReadHang..10:#0 0x0000146f67dd2b90 in cxip_ep_ctrl_eq_progress (ep_obj=<optimized out>, ctrl_evtq=<optimized out>, tx_evtq=true, ep_obj_locked=<optimized out>) at prov/cxi/src/cxip_ctrl.c:355
out192ReadHang..10:#0 ofi_mutex_lock_noop (lock=0x5ec5668) at ./include/ofi_lock.h:295
out192ReadHang..10:#0 0x000014aaec80c0bf in ofi_cq_readfrom (cq_fid=0x50af420, buf=0x7ffc37947580, count=8, src_addr=0x0) at prov/util/src/util_cq.c:221
out192ReadHang..10:#0 0x0000145cac7e3566 in _dl_update_slotinfo () from /lib64/ld-linux-x86-64.so.2
out192ReadHang..10:#0 0x000014c0aebf10f1 in ofi_cq_readfrom (cq_fid=0x519d460, buf=<optimized out>, count=8, src_addr=0x0) at prov/util/src/util_cq.c:230
out192ReadHang..10:#0 0x00001539b2d949b3 in pthread_spin_lock () from /lib64/libpthread.so.0
out192ReadHang..100:#0 cxip_cq_progress (cq=0x409d290) at prov/cxi/src/cxip_cq.c:554
out192ReadHang..100:#0 cxip_util_cq_progress (util_cq=0x57e7460) at prov/cxi/src/cxip_cq.c:566
out192ReadHang..100:#0 0x0000145664c7a919 in MPIDI_progress_test () from /soft/restricted/CNDA/updates/mpich/52.2/mpich-ofi-all-icc-default-pmix-gpu-drop52/lib/libmpi.so.12
out192ReadHang..100:#0 0x000014d4af3acc30 in pthread_spin_lock@plt () from /opt/cray/libfabric/1.15.2.0/lib64/libfabric.so.1
out192ReadHang..100:#0 0x000014d9a26a40ed in ofi_cq_readfrom (cq_fid=0x56df250, buf=<optimized out>, count=8, src_addr=0x0) at prov/util/src/util_cq.c:230
out192ReadHang..100:#0 0x0000153bdb63a9b3 in pthread_spin_lock () from /lib64/libpthread.so.0
out192ReadHang..101:#0 0x0000150e3e649d10 in MPIDU_genq_shmem_queue_dequeue () from /soft/restricted/CNDA/updates/mpich/52.2/mpich-ofi-all-icc-default-pmix-gpu-drop52/lib/libmpi.so.12
out192ReadHang..101:#0 cxip_ep_ctrl_eq_progress (ep_obj=0x42d5660, ctrl_evtq=0x42a0d98, tx_evtq=false, ep_obj_locked=false) at prov/cxi/src/cxip_ctrl.c:320
out192ReadHang..101:#0 0x0000146f7d5038b4 in MPIDI_progress_test () from /soft/restricted/CNDA/updates/mpich/52.2/mpich-ofi-all-icc-default-pmix-gpu-drop52/lib/libmpi.so.12
out192ReadHang..101:#0 0x0000147d3d6120f1 in ofi_cq_readfrom (cq_fid=0x47a1330, buf=<optimized out>, count=8, src_addr=0x0) at prov/util/src/util_cq.c:230
out192ReadHang..102:#0 0x00001525be2920bf in ofi_cq_readfrom (cq_fid=0x5088420, buf=0x7ffdd9447b20, count=8, src_addr=0x0) at prov/util/src/util_cq.c:221
out192ReadHang..102:#0 0x000014899009b9b3 in pthread_spin_lock () from /lib64/libpthread.so.0
out192ReadHang..102:#0 0x000015489fce19b3 in pthread_spin_lock () from /lib64/libpthread.so.0
out192ReadHang..102:#0 cxip_cq_progress (cq=0x5d019a0) at prov/cxi/src/cxip_cq.c:545
out192ReadHang..102:#0 0x000014f4800f59b5 in pthread_spin_lock () from /lib64/libpthread.so.0
out192ReadHang..102:#0 0x000014ee3525c9b3 in pthread_spin_lock () from /lib64/libpthread.so.0
out192ReadHang..103:#0 0x0000154218fdf9b3 in pthread_spin_lock () from /lib64/libpthread.so.0
out192ReadHang..103:#0 0x0000150b35b54346 in MPIDI_OFI_gpu_progress_task () from /soft/restricted/CNDA/updates/mpich/52.2/mpich-ofi-all-icc-default-pmix-gpu-drop52/lib/libmpi.so.12
out192ReadHang..103:#0 0x0000154218fdf9b3 in pthread_spin_lock () from /lib64/libpthread.so.0
out192ReadHang..103:#0 0x0000150b35b54346 in MPIDI_OFI_gpu_progress_task () from /soft/restricted/CNDA/updates/mpich/52.2/mpich-ofi-all-icc-default-pmix-gpu-drop52/lib/libmpi.so.12
out192ReadHang..103:#0 0x00001478d5cc9c30 in pthread_spin_lock@plt () from /opt/cray/libfabric/1.15.2.0/lib64/libfabric.so.1
out192ReadHang..103:#0 0x000014603229e0ed in ofi_cq_readfrom (cq_fid=0x4d25460, buf=<optimized out>, count=8, src_addr=0x0) at prov/util/src/util_cq.c:230
out192ReadHang..103:#0 0x000014b9f43b19b3 in pthread_spin_lock () from /lib64/libpthread.so.0
out192ReadHang..104:#0 ofi_mutex_lock_noop (lock=0x49286a8) at ./include/ofi_lock.h:295
out192ReadHang..104:#0 0x0000149a9598a9b3 in pthread_spin_lock () from /lib64/libpthread.so.0
out192ReadHang..104:#0 0x000014b804b85539 in cxip_cq_eq_progress (eq=0x5e90710, cq=0x5e905f0) at prov/cxi/src/cxip_cq.c:508
out192ReadHang..104:#0 ofi_cq_read (cq_fid=0x58ed630, buf=0x7fff68146540, count=8) at prov/util/src/util_cq.c:286
out192ReadHang..104:#0 0x00001464285c1539 in cxip_cq_eq_progress (eq=0x4b92750, cq=0x4b92630) at prov/cxi/src/cxip_cq.c:508
out192ReadHang..104:#0 0x00001528fa0759b3 in pthread_spin_lock () from /lib64/libpthread.so.0
out192ReadHang..105:#0 cxip_util_cq_progress (util_cq=0x5816460) at prov/cxi/src/cxip_cq.c:560
out192ReadHang..105:#0 cxip_cq_progress (cq=0x5c85630) at prov/cxi/src/cxip_cq.c:545
out192ReadHang..105:#0 0x00001509c72714e7 in cxip_cq_progress (cq=0x4804290) at prov/cxi/src/cxip_cq.c:554
out192ReadHang..106:#0 0x0000150f8fd1bd19 in MPIDU_genq_shmem_queue_dequeue () from /soft/restricted/CNDA/updates/mpich/52.2/mpich-ofi-all-icc-default-pmix-gpu-drop52/lib/libmpi.so.12
out192ReadHang..106:#0 0x000014bf20773919 in MPIDI_progress_test () from /soft/restricted/CNDA/updates/mpich/52.2/mpich-ofi-all-icc-default-pmix-gpu-drop52/lib/libmpi.so.12
out192ReadHang..106:#0 0x00001481f772a9a0 in __tls_get_addr_slow () from /lib64/ld-linux-x86-64.so.2
out192ReadHang..106:#0 0x0000153ab4496915 in MPIDI_progress_test () from /soft/restricted/CNDA/updates/mpich/52.2/mpich-ofi-all-icc-default-pmix-gpu-drop52/lib/libmpi.so.12
out192ReadHang..106:#0 0x00001553d92119b3 in pthread_spin_lock () from /lib64/libpthread.so.0
out192ReadHang..107:#0 0x0000152762f5e9b3 in pthread_spin_lock () from /lib64/libpthread.so.0
out192ReadHang..107:#0 0x00001483c395d274 in ofi_genlock_unlock (lock=0x5589490) at ./include/ofi_lock.h:364
out192ReadHang..107:#0 0x00001532304864e7 in cxip_cq_progress (cq=0x5410460) at prov/cxi/src/cxip_cq.c:554
out192ReadHang..107:#0 cxip_ep_ctrl_progress (ep_obj=0x5ae37f0) at prov/cxi/src/cxip_ctrl.c:374
out192ReadHang..107:#0 0x00001482ce11228a in ofi_cq_readfrom (cq_fid=0x5948290, buf=<optimized out>, count=<optimized out>, src_addr=<optimized out>) at prov/util/src/util_cq.c:282
out192ReadHang..108:#0 0x00001515dad2b9b3 in pthread_spin_lock () from /lib64/libpthread.so.0
out192ReadHang..108:#0 0x00001520cb3bd9b3 in pthread_spin_lock () from /lib64/libpthread.so.0
out192ReadHang..108:#0 0x0000150ff06289b5 in pthread_spin_lock () from /lib64/libpthread.so.0
out192ReadHang..108:#0 0x0000151eb4ef9539 in cxip_cq_eq_progress (eq=0x4aca3b0, cq=0x4aca290) at prov/cxi/src/cxip_cq.c:508
out192ReadHang..109:#0 0x0000146ca368f566 in _dl_update_slotinfo () from /lib64/ld-linux-x86-64.so.2
out192ReadHang..109:#0 0x0000145cd506d9b3 in pthread_spin_lock () from /lib64/libpthread.so.0
out192ReadHang..109:#0 0x0000150f6a5bf9b3 in pthread_spin_lock () from /lib64/libpthread.so.0
out192ReadHang..11:#0 0x0000147b94c080ed in ofi_cq_readfrom (cq_fid=0x5a25b20, buf=<optimized out>, count=8, src_addr=0x0) at prov/util/src/util_cq.c:230
out192ReadHang..110:#0 0x0000149dc25eec30 in pthread_spin_lock@plt () from /opt/cray/libfabric/1.15.2.0/lib64/libfabric.so.1
out192ReadHang..110:#0 0x000014c9c7dc1cce in MPIR_Progress_hook_exec_all () from /soft/restricted/CNDA/updates/mpich/52.2/mpich-ofi-all-icc-default-pmix-gpu-drop52/lib/libmpi.so.12
out192ReadHang..110:#0 0x00001518d72fa9b3 in pthread_spin_lock () from /lib64/libpthread.so.0
out192ReadHang..110:#0 0x0000145be20c353d in MPIDI_progress_test () from /soft/restricted/CNDA/updates/mpich/52.2/mpich-ofi-all-icc-default-pmix-gpu-drop52/lib/libmpi.so.12
out192ReadHang..110:#0 0x00001518d72fa9b3 in pthread_spin_lock () from /lib64/libpthread.so.0
out192ReadHang..110:#0 0x0000145be20c353d in MPIDI_progress_test () from /soft/restricted/CNDA/updates/mpich/52.2/mpich-ofi-all-icc-default-pmix-gpu-drop52/lib/libmpi.so.12
out192ReadHang..111:#0 ofi_genlock_lock (lock=0x45c9660) at ./include/ofi_lock.h:359
out192ReadHang..111:#0 0x0000147f9d6259b3 in pthread_spin_lock () from /lib64/libpthread.so.0
out192ReadHang..111:#0 0x000014faf1c6528a in ofi_cq_readfrom (cq_fid=0x462c250, buf=<optimized out>, count=<optimized out>, src_addr=<optimized out>) at prov/util/src/util_cq.c:282
out192ReadHang..111:#0 0x00001466859449a0 in MPIDI_OFI_gpu_progress_task () from /soft/restricted/CNDA/updates/mpich/52.2/mpich-ofi-all-icc-default-pmix-gpu-drop52/lib/libmpi.so.12
out192ReadHang..112:#0 cxip_cq_progress (cq=0x5712420) at prov/cxi/src/cxip_cq.c:545
out192ReadHang..112:#0 0x0000149e042559b5 in pthread_spin_lock () from /lib64/libpthread.so.0
out192ReadHang..113:#0 0x0000154229929d19 in MPIDU_genq_shmem_queue_dequeue () from /soft/restricted/CNDA/updates/mpich/52.2/mpich-ofi-all-icc-default-pmix-gpu-drop52/lib/libmpi.so.12
out192ReadHang..113:#0 0x0000154391bde9b3 in pthread_spin_lock () from /lib64/libpthread.so.0
out192ReadHang..113:#0 0x0000154cd50149b3 in pthread_spin_lock () from /lib64/libpthread.so.0
out192ReadHang..114:#0 0x00001456782a04e7 in cxip_cq_progress (cq=0x520f420) at prov/cxi/src/cxip_cq.c:554
out192ReadHang..114:#0 0x000014e5f6b7c0ed in ofi_cq_readfrom (cq_fid=0x49c5b00, buf=<optimized out>, count=8, src_addr=0x0) at prov/util/src/util_cq.c:230
out192ReadHang..114:#0 ofi_mutex_lock_noop (lock=0x507bb58) at ./include/ofi_lock.h:295
out192ReadHang..114:#0 0x000014e873c7a0ed in ofi_cq_readfrom (cq_fid=0x4e12250, buf=<optimized out>, count=8, src_addr=0x0) at prov/util/src/util_cq.c:230
out192ReadHang..114:#0 0x00001489c764e0ed in ofi_cq_readfrom (cq_fid=0x5c06290, buf=<optimized out>, count=8, src_addr=0x0) at prov/util/src/util_cq.c:230
out192ReadHang..114:#0 0x000014894e1dd9b3 in pthread_spin_lock () from /lib64/libpthread.so.0
out192ReadHang..115:#0 0x0000147366ba46e8 in _dl_update_slotinfo () from /lib64/ld-linux-x86-64.so.2
out192ReadHang..115:#0 0x00001523097139b3 in pthread_spin_lock () from /lib64/libpthread.so.0
out192ReadHang..115:#0 0x000014d5ff3cc234 in MPIDI_progress_test () from /soft/restricted/CNDA/updates/mpich/52.2/mpich-ofi-all-icc-default-pmix-gpu-drop52/lib/libmpi.so.12
out192ReadHang..115:#0 0x0000151568a4a236 in MPIDI_progress_test () from /soft/restricted/CNDA/updates/mpich/52.2/mpich-ofi-all-icc-default-pmix-gpu-drop52/lib/libmpi.so.12
out192ReadHang..116:#0 0x000014e099b210bf in ofi_cq_readfrom (cq_fid=0x47cf290, buf=0x7ffd19693650, count=8, src_addr=0x0) at prov/util/src/util_cq.c:221
out192ReadHang..116:#0 0x000014721257b4e7 in cxip_cq_progress (cq=0x5d84a60) at prov/cxi/src/cxip_cq.c:554
out192ReadHang..116:#0 0x000014e0591139b5 in pthread_spin_lock () from /lib64/libpthread.so.0
out192ReadHang..117:#0 0x0000150573ad150a in cxip_cq_eq_progress (eq=0x48c3c20, cq=0x48c3b00) at prov/cxi/src/cxip_cq.c:508
out192ReadHang..117:#0 ofi_cq_read (cq_fid=0x43d3b00, buf=0x7fffafe2ee60, count=8) at prov/util/src/util_cq.c:286
out192ReadHang..117:#0 ofi_genlock_unlock (lock=0x483d300) at ./include/ofi_lock.h:364
out192ReadHang..117:#0 0x000014c51ee5a9b3 in pthread_spin_lock () from /lib64/libpthread.so.0
out192ReadHang..117:#0 0x0000150e027f3d2a in MPIDU_genq_shmem_queue_dequeue () from /soft/restricted/CNDA/updates/mpich/52.2/mpich-ofi-all-icc-default-pmix-gpu-drop52/lib/libmpi.so.12
out192ReadHang..117:#0 0x0000151f2d4ae0ed in ofi_cq_readfrom (cq_fid=0x4c98290, buf=<optimized out>, count=8, src_addr=0x0) at prov/util/src/util_cq.c:230
out192ReadHang..118:#0 0x000014d1e28e39b3 in pthread_spin_lock () from /lib64/libpthread.so.0
out192ReadHang..118:#0 0x00001552246ce109 in MPIDI_progress_test () from /soft/restricted/CNDA/updates/mpich/52.2/mpich-ofi-all-icc-default-pmix-gpu-drop52/lib/libmpi.so.12
out192ReadHang..118:#0 ofi_mutex_lock_noop (lock=0x48e24d8) at ./include/ofi_lock.h:295
out192ReadHang..118:#0 0x000014a6c8c95d10 in MPIDU_genq_shmem_queue_dequeue () from /soft/restricted/CNDA/updates/mpich/52.2/mpich-ofi-all-icc-default-pmix-gpu-drop52/lib/libmpi.so.12
out192ReadHang..118:#0 cxip_cq_eq_progress (eq=0x4352920, cq=0x4352800) at prov/cxi/src/cxip_cq.c:535
out192ReadHang..118:#0 0x000014a6c8c95d10 in MPIDU_genq_shmem_queue_dequeue () from /soft/restricted/CNDA/updates/mpich/52.2/mpich-ofi-all-icc-default-pmix-gpu-drop52/lib/libmpi.so.12
out192ReadHang..118:#0 cxip_cq_eq_progress (eq=0x4352920, cq=0x4352800) at prov/cxi/src/cxip_cq.c:535
out192ReadHang..118:#0 0x00001468bc18dd19 in MPIDU_genq_shmem_queue_dequeue () from /soft/restricted/CNDA/updates/mpich/52.2/mpich-ofi-all-icc-default-pmix-gpu-drop52/lib/libmpi.so.12
out192ReadHang..119:#0 0x000014564b3160ed in ofi_cq_readfrom (cq_fid=0x49e95f0, buf=<optimized out>, count=8, src_addr=0x0) at prov/util/src/util_cq.c:230
out192ReadHang..119:#0 0x0000147c53dfd9b3 in pthread_spin_lock () from /lib64/libpthread.so.0
out192ReadHang..119:#0 0x000014cb837cb9b3 in pthread_spin_lock () from /lib64/libpthread.so.0
out192ReadHang..119:#0 ofi_cq_read (cq_fid=0x519a630, buf=0x7ffe6be63ac0, count=8) at prov/util/src/util_cq.c:286
out192ReadHang..119:#0 0x000014965f1889b3 in pthread_spin_lock () from /lib64/libpthread.so.0
out192ReadHang..12:#0 0x000014dd6c0f79b3 in pthread_spin_lock () from /lib64/libpthread.so.0
out192ReadHang..12:#0 0x000014c89cc6aca1 in MPIR_Progress_hook_exec_all () from /soft/restricted/CNDA/updates/mpich/52.2/mpich-ofi-all-icc-default-pmix-gpu-drop52/lib/libmpi.so.12
out192ReadHang..12:#0 0x00001457997e89b3 in pthread_spin_lock () from /lib64/libpthread.so.0
out192ReadHang..12:#0 ofi_cq_read (cq_fid=0x41e9610, buf=0x7ffcb0f42160, count=8) at prov/util/src/util_cq.c:286
out192ReadHang..120:#0 0x0000149836cc99b3 in pthread_spin_lock () from /lib64/libpthread.so.0
out192ReadHang..120:#0 cxip_ep_ctrl_progress (ep_obj=0x402b660) at prov/cxi/src/cxip_ctrl.c:374
out192ReadHang..120:#0 0x0000145f480fb549 in MPIDI_progress_test () from /soft/restricted/CNDA/updates/mpich/52.2/mpich-ofi-all-icc-default-pmix-gpu-drop52/lib/libmpi.so.12
out192ReadHang..120:#0 0x0000151a30fe4919 in MPIDI_progress_test () from /soft/restricted/CNDA/updates/mpich/52.2/mpich-ofi-all-icc-default-pmix-gpu-drop52/lib/libmpi.so.12
out192ReadHang..121:#0 ofi_mutex_lock_noop (lock=0x40fc6a8) at ./include/ofi_lock.h:295
out192ReadHang..121:#0 ofi_genlock_lock (lock=0x5ae0660) at ./include/ofi_lock.h:359
out192ReadHang..121:#0 0x000014f9853c2539 in cxip_cq_eq_progress (eq=0x507a580, cq=0x507a460) at prov/cxi/src/cxip_cq.c:508
out192ReadHang..121:#0 cxip_cq_eq_progress (eq=0x445a710, cq=0x445a5f0) at prov/cxi/src/cxip_cq.c:535
out192ReadHang..122:#0 0x000014559bf929b3 in pthread_spin_lock () from /lib64/libpthread.so.0
out192ReadHang..122:#0 0x000014643c6bc0ed in ofi_cq_readfrom (cq_fid=0x4523460, buf=<optimized out>, count=8, src_addr=0x0) at prov/util/src/util_cq.c:230
out192ReadHang..123:#0 0x000014e5bb4d2512 in cxip_cq_eq_progress (eq=0x4a7cbe0, cq=0x4a7cac0) at prov/cxi/src/cxip_cq.c:508
out192ReadHang..123:#0 0x000014fdf760a543 in cxi_eq_peek_event (eq=0x5430a58) at /usr/include/cxi_prov_hw.h:1537
out192ReadHang..123:#0 0x000014b1ee9a4539 in MPIDI_progress_test () from /soft/restricted/CNDA/updates/mpich/52.2/mpich-ofi-all-icc-default-pmix-gpu-drop52/lib/libmpi.so.12
out192ReadHang..123:#0 0x00001497fc97f284 in MPIDI_progress_test () from /soft/restricted/CNDA/updates/mpich/52.2/mpich-ofi-all-icc-default-pmix-gpu-drop52/lib/libmpi.so.12
out192ReadHang..124:#0 0x0000150054cd10f1 in ofi_cq_readfrom (cq_fid=0x49cc460, buf=<optimized out>, count=8, src_addr=0x0) at prov/util/src/util_cq.c:230
out192ReadHang..124:#0 0x000014b9823040ed in ofi_cq_readfrom (cq_fid=0x5ba4630, buf=<optimized out>, count=8, src_addr=0x0) at prov/util/src/util_cq.c:230
out192ReadHang..124:#0 0x000014ae3df3b9b3 in pthread_spin_lock () from /lib64/libpthread.so.0
out192ReadHang..125:#0 0x000014a674ae3b12 in cxi_eq_peek_event (eq=0x5bd8c58) at /usr/include/cxi_prov_hw.h:1540
out192ReadHang..125:#0 cxip_ep_ctrl_progress (ep_obj=0x58d4e20) at prov/cxi/src/cxip_ctrl.c:374
out192ReadHang..125:#0 ofi_mutex_lock_noop (lock=0x47b7b58) at ./include/ofi_lock.h:295
out192ReadHang..125:#0 0x000014c08b0609b3 in pthread_spin_lock () from /lib64/libpthread.so.0
out192ReadHang..125:#0 0x00001458c92e728a in ofi_cq_readfrom (cq_fid=0x3fdd460, buf=<optimized out>, count=<optimized out>, src_addr=<optimized out>) at prov/util/src/util_cq.c:282
out192ReadHang..125:#0 0x000014ec3e0fc9b3 in pthread_spin_lock () from /lib64/libpthread.so.0
out192ReadHang..126:#0 0x00001542132729b3 in pthread_spin_lock () from /lib64/libpthread.so.0
out192ReadHang..125:#0 0x000014ec3e0fc9b3 in pthread_spin_lock () from /lib64/libpthread.so.0
out192ReadHang..126:#0 0x00001542132729b3 in pthread_spin_lock () from /lib64/libpthread.so.0
out192ReadHang..126:#0 0x000014807f36e0bf in ofi_cq_readfrom (cq_fid=0x4ab8420, buf=0x7ffe9110a7e0, count=8, src_addr=0x0) at prov/util/src/util_cq.c:221
out192ReadHang..126:#0 0x000014bc0410d0ed in ofi_cq_readfrom (cq_fid=0x4c4a250, buf=<optimized out>, count=8, src_addr=0x0) at prov/util/src/util_cq.c:230
out192ReadHang..126:#0 0x0000152dcfe4d0ed in ofi_cq_readfrom (cq_fid=0x4f89290, buf=<optimized out>, count=8, src_addr=0x0) at prov/util/src/util_cq.c:230
out192ReadHang..126:#0 0x0000151fa6123d19 in MPIDU_genq_shmem_queue_dequeue () from /soft/restricted/CNDA/updates/mpich/52.2/mpich-ofi-all-icc-default-pmix-gpu-drop52/lib/libmpi.so.12
out192ReadHang..127:#0 0x000014590b29f0ed in ofi_cq_readfrom (cq_fid=0x46f1250, buf=<optimized out>, count=8, src_addr=0x0) at prov/util/src/util_cq.c:230
out192ReadHang..127:#0 0x00001461aaf4809b in __tls_get_addr () from /lib64/ld-linux-x86-64.so.2
out192ReadHang..127:#0 0x000014d96066f234 in MPIDI_progress_test () from /soft/restricted/CNDA/updates/mpich/52.2/mpich-ofi-all-icc-default-pmix-gpu-drop52/lib/libmpi.so.12
out192ReadHang..127:#0 0x000014c7812ce0ed in ofi_cq_readfrom (cq_fid=0x41f6ae0, buf=<optimized out>, count=8, src_addr=0x0) at prov/util/src/util_cq.c:230
out192ReadHang..127:#0 0x0000149b6f0399b3 in pthread_spin_lock () from /lib64/libpthread.so.0
out192ReadHang..128:#0 0x000014788a0910ed in ofi_cq_readfrom (cq_fid=0x5af1290, buf=<optimized out>, count=8, src_addr=0x0) at prov/util/src/util_cq.c:230
out192ReadHang..128:#0 0x0000153a352b60ed in ofi_cq_readfrom (cq_fid=0x58d8420, buf=<optimized out>, count=8, src_addr=0x0) at prov/util/src/util_cq.c:230
out192ReadHang..128:#0 0x000014bb79165b12 in cxi_eq_peek_event (eq=0x4197de8) at /usr/include/cxi_prov_hw.h:1540
out192ReadHang..128:#0 cxip_util_cq_progress (util_cq=0x3e58aa0) at prov/cxi/src/cxip_cq.c:560
out192ReadHang..128:#0 0x0000152ea5d70301 in MPIDI_progress_test () from /soft/restricted/CNDA/updates/mpich/52.2/mpich-ofi-all-icc-default-pmix-gpu-drop52/lib/libmpi.so.12
out192ReadHang..128:#0 cxi_eq_peek_event (eq=0x43eb948) at /usr/include/cxi_prov_hw.h:1532
out192ReadHang..129:#0 0x00001493da6640bf in ofi_cq_readfrom (cq_fid=0x3e1aac0, buf=0x7fff5e0fb9c0, count=8, src_addr=0x0) at prov/util/src/util_cq.c:221
out192ReadHang..129:#0 0x0000149f09d6b9b3 in pthread_spin_lock () from /lib64/libpthread.so.0
out192ReadHang..129:#0 0x00001552d6b419b3 in pthread_spin_lock () from /lib64/libpthread.so.0
out192ReadHang..129:#0 0x0000151613b123e0 in ofi_mutex_unlock_noop () at src/common.c:996
out192ReadHang..13:#0 0x000014c05360f9b3 in pthread_spin_lock () from /lib64/libpthread.so.0
out192ReadHang..13:#0 0x0000147d213680fd in ofi_genlock_unlock (lock=0x4210660) at ./include/ofi_lock.h:364
out192ReadHang..13:#0 0x0000152a1dd95506 in cxip_cq_eq_progress (eq=0x54f23b0, cq=0x54f2290) at prov/cxi/src/cxip_cq.c:508
out192ReadHang..13:#0 0x00001467aed7b0ed in ofi_cq_readfrom (cq_fid=0x4cbb250, buf=<optimized out>, count=8, src_addr=0x0) at prov/util/src/util_cq.c:230
out192ReadHang..130:#0 cxip_ep_ctrl_eq_progress (ep_obj=0x5669660, ctrl_evtq=0x5634d78, tx_evtq=false, ep_obj_locked=false) at prov/cxi/src/cxip_ctrl.c:320
out192ReadHang..130:#0 0x000014e57a5deb8e in cxip_ep_ctrl_eq_progress (ep_obj=<optimized out>, ctrl_evtq=<optimized out>, tx_evtq=false, ep_obj_locked=<optimized out>) at prov/cxi/src/cxip_ctrl.c:355
out192ReadHang..130:#0 0x000015160ec5b0fd in ofi_genlock_unlock (lock=0x48882c0) at ./include/ofi_lock.h:364
out192ReadHang..130:#0 0x0000146e742e5c30 in pthread_spin_lock@plt () from /opt/cray/libfabric/1.15.2.0/lib64/libfabric.so.1
out192ReadHang..131:#0 cxip_ep_ctrl_progress (ep_obj=0x5aa5490) at prov/cxi/src/cxip_ctrl.c:372
out192ReadHang..131:#0 cxip_util_cq_progress (util_cq=0x581c250) at prov/cxi/src/cxip_cq.c:566
out192ReadHang..131:#0 0x000014af9e537915 in MPIDI_progress_test () from /soft/restricted/CNDA/updates/mpich/52.2/mpich-ofi-all-icc-default-pmix-gpu-drop52/lib/libmpi.so.12
out192ReadHang..131:#0 cxip_util_cq_progress (util_cq=0x581c250) at prov/cxi/src/cxip_cq.c:566
out192ReadHang..131:#0 0x000014af9e537915 in MPIDI_progress_test () from /soft/restricted/CNDA/updates/mpich/52.2/mpich-ofi-all-icc-default-pmix-gpu-drop52/lib/libmpi.so.12
out192ReadHang..131:#0 0x000014725d5070ed in ofi_cq_readfrom (cq_fid=0x5a793a0, buf=<optimized out>, count=8, src_addr=0x0) at prov/util/src/util_cq.c:230
out192ReadHang..132:#0 0x00001522fe65ab8e in cxip_ep_ctrl_eq_progress (ep_obj=<optimized out>, ctrl_evtq=<optimized out>, tx_evtq=false, ep_obj_locked=<optimized out>) at prov/cxi/src/cxip_ctrl.c:355
out192ReadHang..132:#0 0x0000148ea6706506 in cxip_cq_eq_progress (eq=0x43ba540, cq=0x43ba420) at prov/cxi/src/cxip_cq.c:508
out192ReadHang..132:#0 0x0000151bbd27e9b3 in pthread_spin_lock () from /lib64/libpthread.so.0
out192ReadHang..132:#0 0x00001461fa023b12 in cxi_eq_peek_event (eq=0x460d9a8) at /usr/include/cxi_prov_hw.h:1540
out192ReadHang..133:#0 0x00001499d046309b in __tls_get_addr () from /lib64/ld-linux-x86-64.so.2
out192ReadHang..133:#0 0x00001463b556e9b3 in pthread_spin_lock () from /lib64/libpthread.so.0
out192ReadHang..133:#0 cxip_ep_ctrl_progress (ep_obj=0x5956660) at prov/cxi/src/cxip_ctrl.c:374
out192ReadHang..133:#0 0x0000148aa41da6e9 in _dl_update_slotinfo () from /lib64/ld-linux-x86-64.so.2
out192ReadHang..134:#0 0x000014b4f89f4d2a in MPIDU_genq_shmem_queue_dequeue () from /soft/restricted/CNDA/updates/mpich/52.2/mpich-ofi-all-icc-default-pmix-gpu-drop52/lib/libmpi.so.12
out192ReadHang..134:#0 0x00001553cb443d1f in MPIDU_genq_shmem_queue_dequeue () from /soft/restricted/CNDA/updates/mpich/52.2/mpich-ofi-all-icc-default-pmix-gpu-drop52/lib/libmpi.so.12
out192ReadHang..134:#0 0x00001495091089b3 in pthread_spin_lock () from /lib64/libpthread.so.0
out192ReadHang..134:#0 0x000014f919324af4 in cxip_ep_ctrl_eq_progress (ep_obj=0x468c620, ctrl_evtq=0x4657bf8, tx_evtq=true, ep_obj_locked=false) at prov/cxi/src/cxip_ctrl.c:320
out192ReadHang..134:#0 0x000014a449eadd2a in MPIDU_genq_shmem_queue_dequeue () from /soft/restricted/CNDA/updates/mpich/52.2/mpich-ofi-all-icc-default-pmix-gpu-drop52/lib/libmpi.so.12
out192ReadHang..134:#0 0x000014b30d5a7274 in ofi_genlock_unlock (lock=0x559e490) at ./include/ofi_lock.h:364
out192ReadHang..135:#0 0x0000146b816d4919 in MPIDI_progress_test () from /soft/restricted/CNDA/updates/mpich/52.2/mpich-ofi-all-icc-default-pmix-gpu-drop52/lib/libmpi.so.12
out192ReadHang..135:#0 0x0000149bc645d0bf in ofi_cq_readfrom (cq_fid=0x5bfe460, buf=0x7ffd30b561a0, count=8, src_addr=0x0) at prov/util/src/util_cq.c:221
out192ReadHang..135:#0 cxi_eq_peek_event (eq=0x54f7aa8) at /usr/include/cxi_prov_hw.h:1532
out192ReadHang..135:#0 0x000014f845c67fec in MPIDI_progress_test () from /soft/restricted/CNDA/updates/mpich/52.2/mpich-ofi-all-icc-default-pmix-gpu-drop52/lib/libmpi.so.12
out192ReadHang..135:#0 0x000014d66fc9e9b3 in pthread_spin_lock () from /lib64/libpthread.so.0
out192ReadHang..136:#0 0x0000150caa4159b3 in pthread_spin_lock () from /lib64/libpthread.so.0
out192ReadHang..136:#0 cxip_cq_progress (cq=0x5162460) at prov/cxi/src/cxip_cq.c:545
out192ReadHang..136:#0 ofi_mutex_lock_noop (lock=0x448f308) at ./include/ofi_lock.h:295
out192ReadHang..136:#0 cxip_cq_progress (cq=0x4f18250) at prov/cxi/src/cxip_cq.c:545
out192ReadHang..137:#0 0x00001525f13e6270 in MPIDI_progress_test () from /soft/restricted/CNDA/updates/mpich/52.2/mpich-ofi-all-icc-default-pmix-gpu-drop52/lib/libmpi.so.12
out192ReadHang..137:#0 0x00001485304f29b3 in pthread_spin_lock () from /lib64/libpthread.so.0
out192ReadHang..137:#0 0x000014f8069c3506 in cxip_cq_eq_progress (eq=0x4d6b710, cq=0x4d6b5f0) at prov/cxi/src/cxip_cq.c:508
out192ReadHang..137:#0 0x000014c55fbe09b3 in pthread_spin_lock () from /lib64/libpthread.so.0
out192ReadHang..138:#0 0x0000145f0b07df7c in MPIDI_progress_test () from /soft/restricted/CNDA/updates/mpich/52.2/mpich-ofi-all-icc-default-pmix-gpu-drop52/lib/libmpi.so.12
out192ReadHang..138:#0 0x0000148be6cd7284 in MPIDI_progress_test () from /soft/restricted/CNDA/updates/mpich/52.2/mpich-ofi-all-icc-default-pmix-gpu-drop52/lib/libmpi.so.12
out192ReadHang..138:#0 0x000014820c7669b3 in pthread_spin_lock () from /lib64/libpthread.so.0
out192ReadHang..139:#0 ofi_mutex_lock_noop (lock=0x4f54308) at ./include/ofi_lock.h:295
out192ReadHang..139:#0 0x00001507266179b3 in pthread_spin_lock () from /lib64/libpthread.so.0
out192ReadHang..139:#0 ofi_mutex_lock_noop (lock=0x4f54308) at ./include/ofi_lock.h:295
out192ReadHang..139:#0 0x00001507266179b3 in pthread_spin_lock () from /lib64/libpthread.so.0
out192ReadHang..139:#0 ofi_mutex_lock_noop (lock=0x51954d8) at ./include/ofi_lock.h:295
out192ReadHang..139:#0 0x000014ebddbea9b3 in pthread_spin_lock () from /lib64/libpthread.so.0
out192ReadHang..139:#0 0x000014584c42b9b3 in pthread_spin_lock () from /lib64/libpthread.so.0
out192ReadHang..139:#0 ofi_cq_read (cq_fid=0x51f7420, buf=0x7ffc6d748f40, count=8) at prov/util/src/util_cq.c:286
out192ReadHang..14:#0 0x00001504aef730ed in ofi_cq_readfrom (cq_fid=0x521d5b0, buf=<optimized out>, count=8, src_addr=0x0) at prov/util/src/util_cq.c:230
out192ReadHang..14:#0 0x000014abcff8b9b3 in pthread_spin_lock () from /lib64/libpthread.so.0
out192ReadHang..14:#0 0x000014a9f79359b3 in pthread_spin_lock () from /lib64/libpthread.so.0
out192ReadHang..140:#0 0x000014601de458a5 in MPIDI_progress_test () from /soft/restricted/CNDA/updates/mpich/52.2/mpich-ofi-all-icc-default-pmix-gpu-drop52/lib/libmpi.so.12
out192ReadHang..140:#0 0x000014d2a0981506 in cxip_cq_eq_progress (eq=0x402f370, cq=0x402f250) at prov/cxi/src/cxip_cq.c:508
out192ReadHang..140:#0 0x0000148847010d19 in MPIDU_genq_shmem_queue_dequeue () from /soft/restricted/CNDA/updates/mpich/52.2/mpich-ofi-all-icc-default-pmix-gpu-drop52/lib/libmpi.so.12
out192ReadHang..140:#0 0x0000151cca47baf2 in cxip_ep_ctrl_eq_progress (ep_obj=0x4a36e20, ctrl_evtq=0x4a1e3b8, tx_evtq=true, ep_obj_locked=false) at prov/cxi/src/cxip_ctrl.c:320
out192ReadHang..140:#0 0x00001517b777e0ed in ofi_cq_readfrom (cq_fid=0x4883290, buf=<optimized out>, count=8, src_addr=0x0) at prov/util/src/util_cq.c:230
out192ReadHang..141:#0 cxip_util_cq_progress (util_cq=0x49f5250) at prov/cxi/src/cxip_cq.c:566
out192ReadHang..141:#0 0x000014c7e60772c1 in MPIDI_progress_test () from /soft/restricted/CNDA/updates/mpich/52.2/mpich-ofi-all-icc-default-pmix-gpu-drop52/lib/libmpi.so.12
out192ReadHang..141:#0 0x00001553157600ed in ofi_cq_readfrom (cq_fid=0x52315f0, buf=<optimized out>, count=8, src_addr=0x0) at prov/util/src/util_cq.c:230
out192ReadHang..141:#0 0x000014854741ad19 in MPIDU_genq_shmem_queue_dequeue () from /soft/restricted/CNDA/updates/mpich/52.2/mpich-ofi-all-icc-default-pmix-gpu-drop52/lib/libmpi.so.12
out192ReadHang..142:#0 0x000014c9e471d563 in _dl_update_slotinfo () from /lib64/ld-linux-x86-64.so.2
out192ReadHang..142:#0 cxip_ep_ctrl_progress (ep_obj=0x5155490) at prov/cxi/src/cxip_ctrl.c:374
out192ReadHang..143:#0 0x000014b92e597b90 in cxip_ep_ctrl_eq_progress (ep_obj=<optimized out>, ctrl_evtq=<optimized out>, tx_evtq=true, ep_obj_locked=<optimized out>) at prov/cxi/src/cxip_ctrl.c:355
out192ReadHang..143:#0 0x000014d6cbadd0ed in ofi_cq_readfrom (cq_fid=0x59f9420, buf=<optimized out>, count=8, src_addr=0x0) at prov/util/src/util_cq.c:230
out192ReadHang..143:#0 0x00001495a5b2c28a in ofi_cq_readfrom (cq_fid=0x5ab1290, buf=<optimized out>, count=<optimized out>, src_addr=<optimized out>) at prov/util/src/util_cq.c:282
out192ReadHang..144:#0 0x000014a0ed85477b in update_get_addr () from /lib64/ld-linux-x86-64.so.2
out192ReadHang..144:#0 ofi_genlock_lock (lock=0x555c300) at ./include/ofi_lock.h:359
out192ReadHang..144:#0 0x00001492108854e7 in cxip_cq_progress (cq=0x47f2290) at prov/cxi/src/cxip_cq.c:554
out192ReadHang..145:#0 0x0000148d9aadb274 in ofi_genlock_unlock (lock=0x42bb2c0) at ./include/ofi_lock.h:364
out192ReadHang..145:#0 0x00001490925b9da8 in MPIDU_genq_shmem_queue_dequeue () from /soft/restricted/CNDA/updates/mpich/52.2/mpich-ofi-all-icc-default-pmix-gpu-drop52/lib/libmpi.so.12
out192ReadHang..146:#0 cxip_cq_eq_progress (eq=0x52c6710, cq=0x52c65f0) at prov/cxi/src/cxip_cq.c:535
out192ReadHang..146:#0 0x00001537bf0cac30 in pthread_spin_lock@plt () from /opt/cray/libfabric/1.15.2.0/lib64/libfabric.so.1
out192ReadHang..146:#0 0x0000151f8dfbe9b3 in pthread_spin_lock () from /lib64/libpthread.so.0
out192ReadHang..147:#0 0x0000154842f493e0 in ofi_mutex_unlock_noop () at src/common.c:996
out192ReadHang..147:#0 0x000014b3381cf230 in MPIDI_progress_test () from /soft/restricted/CNDA/updates/mpich/52.2/mpich-ofi-all-icc-default-pmix-gpu-drop52/lib/libmpi.so.12
out192ReadHang..147:#0 0x00001537706640ed in ofi_cq_readfrom (cq_fid=0x5386b20, buf=<optimized out>, count=8, src_addr=0x0) at prov/util/src/util_cq.c:230
out192ReadHang..147:#0 0x000014b3381cf230 in MPIDI_progress_test () from /soft/restricted/CNDA/updates/mpich/52.2/mpich-ofi-all-icc-default-pmix-gpu-drop52/lib/libmpi.so.12
out192ReadHang..147:#0 0x00001537706640ed in ofi_cq_readfrom (cq_fid=0x5386b20, buf=<optimized out>, count=8, src_addr=0x0) at prov/util/src/util_cq.c:230
out192ReadHang..147:#0 cxip_ep_ctrl_eq_progress (ep_obj=<optimized out>, ctrl_evtq=<optimized out>, tx_evtq=true, ep_obj_locked=<optimized out>) at prov/cxi/src/cxip_ctrl.c:355
out192ReadHang..147:#0 0x0000145fb2dbd9b3 in pthread_spin_lock () from /lib64/libpthread.so.0
out192ReadHang..147:#0 0x0000151c8ca649b3 in pthread_spin_lock () from /lib64/libpthread.so.0
out192ReadHang..147:#0 cxip_util_cq_progress (util_cq=0x4700290) at prov/cxi/src/cxip_cq.c:566
out192ReadHang..147:#0 0x000014dcbd38f919 in MPIDI_progress_test () from /soft/restricted/CNDA/updates/mpich/52.2/mpich-ofi-all-icc-default-pmix-gpu-drop52/lib/libmpi.so.12
out192ReadHang..148:#0 0x00001525e3c75919 in MPIDI_progress_test () from /soft/restricted/CNDA/updates/mpich/52.2/mpich-ofi-all-icc-default-pmix-gpu-drop52/lib/libmpi.so.12
out192ReadHang..148:#0 0x0000154e907c2506 in cxip_cq_eq_progress (eq=0x5535710, cq=0x55355f0) at prov/cxi/src/cxip_cq.c:508
out192ReadHang..148:#0 ofi_cq_read (cq_fid=0x560cb40, buf=0x7fff4a0ce600, count=8) at prov/util/src/util_cq.c:286
out192ReadHang..148:#0 0x0000145bc14469b3 in pthread_spin_lock () from /lib64/libpthread.so.0
out192ReadHang..148:#0 cxip_ep_ctrl_eq_progress (ep_obj=0x47a9490, ctrl_evtq=0x4774338, tx_evtq=false, ep_obj_locked=false) at prov/cxi/src/cxip_ctrl.c:320
out192ReadHang..148:#0 0x00001540e99bf0ed in ofi_cq_readfrom (cq_fid=0x45a7b20, buf=<optimized out>, count=8, src_addr=0x0) at prov/util/src/util_cq.c:230
out192ReadHang..149:#0 0x000015076feb7506 in cxip_cq_eq_progress (eq=0x5dc8be0, cq=0x5dc8ac0) at prov/cxi/src/cxip_cq.c:508
out192ReadHang..149:#0 0x000014fc837bf0ed in ofi_cq_readfrom (cq_fid=0x52f4460, buf=<optimized out>, count=8, src_addr=0x0) at prov/util/src/util_cq.c:230
out192ReadHang..149:#0 0x00001521f95079b3 in pthread_spin_lock () from /lib64/libpthread.so.0
out192ReadHang..15:#0 0x000014bdf437b9b3 in pthread_spin_lock () from /lib64/libpthread.so.0
out192ReadHang..15:#0 0x00001522c0d26539 in cxip_cq_eq_progress (eq=0x5dd95f0, cq=0x5dd94d0) at prov/cxi/src/cxip_cq.c:508
out192ReadHang..150:#0 0x000014ffc7fe8175 in MPIDI_progress_test () from /soft/restricted/CNDA/updates/mpich/52.2/mpich-ofi-all-icc-default-pmix-gpu-drop52/lib/libmpi.so.12
out192ReadHang..150:#0 0x00001514ff30d9b3 in pthread_spin_lock () from /lib64/libpthread.so.0
out192ReadHang..150:#0 0x000014f9650f2919 in MPIDI_progress_test () from /soft/restricted/CNDA/updates/mpich/52.2/mpich-ofi-all-icc-default-pmix-gpu-drop52/lib/libmpi.so.12
out192ReadHang..151:#0 0x000014f2c85f99b3 in pthread_spin_lock () from /lib64/libpthread.so.0
out192ReadHang..151:#0 0x0000150652a20da8 in MPIDU_genq_shmem_queue_dequeue () from /soft/restricted/CNDA/updates/mpich/52.2/mpich-ofi-all-icc-default-pmix-gpu-drop52/lib/libmpi.so.12
out192ReadHang..151:#0 0x0000149a60fbbaf4 in cxip_ep_ctrl_eq_progress (ep_obj=0x5709de0, ctrl_evtq=0x56f1378, tx_evtq=true, ep_obj_locked=false) at prov/cxi/src/cxip_ctrl.c:320
out192ReadHang..151:#0 0x000014cdafa660ed in ofi_cq_readfrom (cq_fid=0x50a3ae0, buf=<optimized out>, count=8, src_addr=0x0) at prov/util/src/util_cq.c:230
out192ReadHang..151:#0 0x000015092ec909b3 in pthread_spin_lock () from /lib64/libpthread.so.0
out192ReadHang..152:#0 0x000014c7f60d8517 in cxip_cq_eq_progress (eq=0x53903b0, cq=0x5390290) at prov/cxi/src/cxip_cq.c:508
out192ReadHang..152:#0 0x00001526ae2886e2 in _dl_update_slotinfo () from /lib64/ld-linux-x86-64.so.2
out192ReadHang..152:#0 0x0000153b51553301 in MPIDI_progress_test () from /soft/restricted/CNDA/updates/mpich/52.2/mpich-ofi-all-icc-default-pmix-gpu-drop52/lib/libmpi.so.12
out192ReadHang..152:#0 ofi_cq_read (cq_fid=0x46bc3a0, buf=0x7ffe05a57400, count=8) at prov/util/src/util_cq.c:286
out192ReadHang..152:#0 0x000014906d3800ed in ofi_cq_readfrom (cq_fid=0x4ff0420, buf=<optimized out>, count=8, src_addr=0x0) at prov/util/src/util_cq.c:230
getting it down to a page of what I perceive to be the most likely suspects (and you can see who I have taken my eye off with the grep -v list)
kjansen@aurora-uan-0009:~> grep -v pthread 809DoingWhat.log |grep -v ofi_cq|grep -v MPIDI_progress_test |grep -v cxip_ep_ctrl_progress |grep -v cxi_eq_peek_event |grep -v cxip |grep -v noop |grep -v lib64 |grep -v ofi_ |grep -v MPIDU_genq_shmem_queue_dequeue |grep -v MPIDI_OFI_gpu_progress_task
out192ReadHang..110:#0 0x000014c9c7dc1cce in MPIR_Progress_hook_exec_all () from /soft/restricted/CNDA/updates/mpich/52.2/mpich-ofi-all-icc-default-pmix-gpu-drop52/lib/libmpi.so.12
out192ReadHang..12:#0 0x000014c89cc6aca1 in MPIR_Progress_hook_exec_all () from /soft/restricted/CNDA/updates/mpich/52.2/mpich-ofi-all-icc-default-pmix-gpu-drop52/lib/libmpi.so.12
out192ReadHang..153:#0 0x0000148f14a3cc90 in MPIR_Progress_hook_exec_all () from /soft/restricted/CNDA/updates/mpich/52.2/mpich-ofi-all-icc-default-pmix-gpu-drop52/lib/libmpi.so.12
out192ReadHang..175:#0 0x000014fc86d2eca1 in MPIR_Progress_hook_exec_all () from /soft/restricted/CNDA/updates/mpich/52.2/mpich-ofi-all-icc-default-pmix-gpu-drop52/lib/libmpi.so.12
out192ReadHang..176:#0 0x000014d58561abb0 in __tls_get_addr@plt () from /soft/restricted/CNDA/updates/mpich/52.2/mpich-ofi-all-icc-default-pmix-gpu-drop52/lib/libmpi.so.12
out192ReadHang..180:#0 0x000014685f2c9cce in MPIR_Progress_hook_exec_all () from /soft/restricted/CNDA/updates/mpich/52.2/mpich-ofi-all-icc-default-pmix-gpu-drop52/lib/libmpi.so.12
out192ReadHang..24:#0 0x000014dae7854b34 in MPIDI_POSIX_eager_recv_begin () from /soft/restricted/CNDA/updates/mpich/52.2/mpich-ofi-all-icc-default-pmix-gpu-drop52/lib/libmpi.so.12
out192ReadHang..38:#0 0x000014ca202b6b49 in MPIDI_POSIX_eager_recv_begin () from /soft/restricted/CNDA/updates/mpich/52.2/mpich-ofi-all-icc-default-pmix-gpu-drop52/lib/libmpi.so.12
out192ReadHang..44:#0 0x000014d74b4efcce in MPIR_Progress_hook_exec_all () from /soft/restricted/CNDA/updates/mpich/52.2/mpich-ofi-all-icc-default-pmix-gpu-drop52/lib/libmpi.so.12
out192ReadHang..46:#0 0x0000151c2f05dc90 in MPIR_Progress_hook_exec_all () from /soft/restricted/CNDA/updates/mpich/52.2/mpich-ofi-all-icc-default-pmix-gpu-drop52/lib/libmpi.so.12
out192ReadHang..47:#0 0x000015276af2ab34 in MPIDI_POSIX_eager_recv_begin () from /soft/restricted/CNDA/updates/mpich/52.2/mpich-ofi-all-icc-default-pmix-gpu-drop52/lib/libmpi.so.12
out192ReadHang..50:#0 0x000014cbc3d84ad0 in MPIDI_POSIX_eager_recv_begin () from /soft/restricted/CNDA/updates/mpich/52.2/mpich-ofi-all-icc-default-pmix-gpu-drop52/lib/libmpi.so.12
out192ReadHang..51:#0 0x000014d545d91b53 in MPIDI_POSIX_eager_recv_begin () from /soft/restricted/CNDA/updates/mpich/52.2/mpich-ofi-all-icc-default-pmix-gpu-drop52/lib/libmpi.so.12
out192ReadHang..55:#0 0x00001501093b2bb0 in __tls_get_addr@plt () from /soft/restricted/CNDA/updates/mpich/52.2/mpich-ofi-all-icc-default-pmix-gpu-drop52/lib/libmpi.so.12
out192ReadHang..58:#0 0x000014f9492dabb0 in __tls_get_addr@plt () from /soft/restricted/CNDA/updates/mpich/52.2/mpich-ofi-all-icc-default-pmix-gpu-drop52/lib/libmpi.so.12
out192ReadHang..73:#0 0x000014fd5f5e1cce in MPIR_Progress_hook_exec_all () from /soft/restricted/CNDA/updates/mpich/52.2/mpich-ofi-all-icc-default-pmix-gpu-drop52/lib/libmpi.so.12
out192ReadHang..85:#0 0x0000153b53336b30 in MPIDI_POSIX_eager_recv_begin () from /soft/restricted/CNDA/updates/mpich/52.2/mpich-ofi-all-icc-default-pmix-gpu-drop52/lib/libmpi.so.12
tls_get_addr has #2 0x0000150109daceda in MPIR_Allreduce
As does MPIDI_POSIX_eager_recv_begin and MPIR_Progress_hook_exec_all
so perhaps that means one of the grep -v terms is the villain.
I don't know these functions so giving up to let more knowledgable people sift the haystack for the needle.
The shmem
and pthreads
parts in the backtraces stick out to me right now. That seems like some weird race condition to me, but within some kind of multithreading scheme, not rank-to-rank MPI. I wonder if it's possible to turn off multithreading or the shared memory parallelism. Although I'm not sure what performance hit that will cause.
96 nodes were able to read that same file, run 1000 steps, and wrote a new one (297GB in 12.4 seconds according to PETSc VecView timers).
Currently running a second attempt at 192 nodes to try to read the file 96 nodes just wrote. It looks like it succeeded this time (past all CGNS reads I think). Not sure if this is because CGNS really wrote the file this time (last time it was a cp of a CGNS written file within a directory changed from 16 (what it was written from CGNS in) to -1 or if we are just into the limits of reliability of GGNS + Lustre and encountering the "locking and contention" issues that motivated subfiling in papers
But we won't get write performance numbers from that run because....
ping failed on x4217c6s6b0n0: No reply from x4309c7s1b0n0.hostmgmt2309.cm.aurora.alcf.anl.gov after 97s
The
shmem
andpthreads
parts in the backtraces stick out to me right now. That seems like some weird race condition to me, but within some kind of multithreading scheme, not rank-to-rank MPI. I wonder if it's possible to turn off multithreading or the shared memory parallelism. Although I'm not sure what performance hit that will cause.
This raises an interesting question? My Aurora runs have been with 12 ranks per node because the flow solve is entirely on the GPU and the CPUs are "assisting" that phase but they are of course primary for IO and problem setup.
When one of those 12 CPU processes on a given node calls CGNS and on to HDF5 is it trying to use threads to gain more parallelism in the read/write or is it sticking to one process per MPI process?
A related question, is it relevant to debug this with the solver in CPU mode (not taking many steps because it will be slower) and use the full 104 Saphire Rapids cores? This would get us into interesting (failing???) MPI process counts with far fewer nodes as there are 8.67 processors for every "tile" (104 vs 12) so we can get to 10k with 96 nodes or 20k with 192. That said, earlier when we were debugging, I was getting hangs when I tried to use 48 processes per node on this problem. The nodes have a TB of memory so I don't think I was exhausting that.
Back to the original error in this ticket, just documenting a brief code dive:
mismatch in number of children and child IDs read
comes from src/cgns_internals.c:cgi_get_nodes
, which is called from cg_open
to search for the CGNSLibraryVersion_t
node. The mismatch that it's talking about the difference between src/cgns_io.c:cgio_number_children
(which counts the number of child nodes root node) and src/cgns_io.c:cgio_children_ids
(which gets the actual id numbers for those children).
Both cgio_number_children
and cgio_children_ids
call H5Literate2
(key word there is "iterate", not "literate"), which simply loops over child nodes of the HDF5 given to it and runs a function on it. I'm not sure how these could disagree with each other frankly. But it'd be interesting to augment the error message to see what it thinks those children numbers are and compare them with the actual file (which iirc, should have only 2 or 3).
Second attempt at 192 with 96 written input revives our original error for this thread
[939]PETSC ERROR: CGNS error 1 mismatch in number of children and child IDs read
kjansen@aurora-uan-0010:/lus/gecko/projects/PHASTA_aesp_CNDA/BumpQ2> grep "child IDs" JZ192Nodes1215_240108.o622110 |wc
24 336 1944
so 2 nodes likely found a bad Lustre?
When one of those 12 CPU processes on a given node calls CGNS and on to HDF5 is it trying to use threads to gain more parallelism in the read right or is it sticking to one process per MPI process?
I believe it's running multiple threads per MPI process. I can't think of another reason why pthreads_spin_lock
would be called instead of MPI_wait
if it was a single process.
I am hopefully not jinxing it but so far larger process counts are more successful with the minus one striping choice. The 1536 has not run yet but the 768 case read and wrote correctly all four times
kjansen@aurora-uan-0009:/lus/gecko/projects/PHASTA_aesp_CNDA/BumpQ2> du -sh *-4[6-9]000.cgns |head
197G Q2fromQ1_21k-46000.cgns
197G Q2fromQ1_21k-47000.cgns
197G Q2fromQ1_21k-48000.cgns
197G Q2fromQ1_21k-49000.cgns
425M stats-46000.cgns
425M stats-47000.cgns
425M stats-48000.cgns
425M stats-49000.cgns
kjansen@aurora-uan-0010:/lus/gecko/projects/PHASTA_aesp_CNDA/BumpQ2> grep VecView JZ768Nodes1215_240108.o*
JZ768Nodes1215_240108.o620924:VecView 2 1.0 1.7643e+01 1.0 1.43e+06 2.0 3.5e+05 2.1e+04 3.0e+01 1 0 0 0 0 1 0 0 0 0 720 7269759248 0 0.00e+00 0 0.00e+00 97
JZ768Nodes1215_240108.o620925:VecView 2 1.0 1.8186e+01 1.0 1.43e+06 2.1 3.5e+05 2.1e+04 3.0e+01 1 0 0 0 0 1 0 0 0 0 699 7277891282 0 0.00e+00 0 0.00e+00 97
JZ768Nodes1215_240108.o620926:VecView 2 1.0 1.3742e+01 1.0 1.43e+06 1.7 3.5e+05 2.1e+04 3.0e+01 1 0 0 0 0 1 0 0 0 0 925 7634409062 0 0.00e+00 0 0.00e+00 97
JZ768Nodes1215_240108.o620927:VecView 2 1.0 1.3611e+01 1.0 1.43e+06 2.1 3.5e+05 2.1e+04 3.0e+01 1 0 0 0 0 1 0 0 0 0 933 7258857984 0 0.00e+00 0 0.00e+00 97
We don't have timers yet on the reader but you can see the writer is called twice once for a big file and once for a small file and there is some variation in the performance but 13 to 18 seconds is more than acceptable I think for a large and small file. I am not sure when the 1536 case will be picked as there are only 2048 notes as far as I know.
My second battery of jobs are running and still so far so good with no read or write failures. Since we didn't really change code and only changed the luster striping I think we have to attribute this behavior to the Lustre striping or perhaps luck innot getting bad nodes. Will keep you posted as more data is obtained.
That said we still don't have any data from 1536 nodes. I am wondering if there are even 1536 nodes up.
With help from Tim Williams, the mystery of why my 1536 node jobs are not running is resolved
kjansen@aurora-uan-0010:/lus/gecko/projects/PHASTA_aesp_CNDA/petsc-kjansen-forkJZ> /home/zippy/bin/pu_nodeStat EarlyAppAccess
PARTITION: LustreApps (EarlyAppAccess)
------------------------
Nodes Status
----- ------
1 down
17 free
1417 job-exclusive
417 offline
0 state-unknown
18 state-unknown,down
2 state-unknown,down,offline
0 state-unknown,offline
0 broken
----- --------------
1872 Total Nodes
I have queued 4 1152 node jobs but of course they will take a while to get enough priority to run (and machine emptying out as I am not sure they have a drain for large jobs with priority at this point anyway).
WOOHOOO. We are running on 1124 nodes, 13488 tiles and thus have broken the 10k GPU barrier finally (previously CGNS+HDF5+Lustre were erroring out on the read of our inputs).
No coded change. It is either the Lustre striping to -1 (32 stripes is the max I think) OR they finally pulled the bad nodes that could not really talk properly to the Lustre file system out of service.
I had to qalter my job node request down to what Tim's script said was available this morning to get it to go. They have a large job queue problem in that I suspect it drained all night to try to get a mysteriously missing 200+ nodes in job-exclusive category of that script. I have seen this many times on new machines so I just got around it with qalter.
Job appears to have run the requested 1k steps and written to CGNS correctly as well.
kjansen@aurora-uan-0009:/lus/gecko/projects/PHASTA_aesp_CNDA/BumpQ2> grep VecView JZ1536Nodes1215_240108.o621462
VecView 2 1.0 2.2446e+01 1.0 9.87e+05 2.0 5.2e+05 1.6e+04 3.0e+01 2 0 0 0 0 2 0 0 0 0 574 7316723234 0 0.00e+00 0 0.00e+00 97
VecView 2 1.0 2.2786e+01 1.0 9.87e+05 2.0 5.2e+05 1.6e+04 3.0e+01 2 0 0 0 0 2 0 0 0 0 565 7383870184 0 0.00e+00 0 0.00e+00 97
VecView 2 1.0 2.4723e+01 1.0 9.87e+05 2.0 5.2e+05 1.6e+04 3.0e+01 2 0 0 0 0 2 0 0 0 0 521 7377604385 0 0.00e+00 0 0.00e+00 97
so the job ran three times with the same inputs and ran out of time on the 4th. So that is 4/4 successful reads and 3/3 successful writes. Not that log file is miss-named since I qaltered the node count to hit was was available (1124).
22-24 seconds is about half the rate we were getting out of lower node count ( O (12) seconds) but still not bad.
What stripe size are you using? You might try setting the alignment in HDF5 to the Lustre stripe size. http://cgns.github.io/CGNS_docs_current/midlevel/fileops.html, CG_CONFIG_HDF5_ALIGNMENT
Are you doing independent or collective IO? What are those numbers in terms of bandwidth? Do you have the darshan logs?
@jedbrown and @jrwrigh might have better answers but I will share what I know:
Stripe size:
We have only set stripe count. We initially did kjansen@aurora-uan-0009:/lus/gecko/projects/PHASTA_aesp_CNDA/BumpQ2> lfs setstripe -c 16 .
but then went to -1. No direct setting of stripe size but I guess we can see what I got
kjansen@aurora-uan-0009:/lus/gecko/projects/PHASTA_aesp_CNDA/BumpQ2> lfs getstripe Q2fromQ1_21k-42000.cgns
Q2fromQ1_21k-42000.cgns
lmm_stripe_count: 16
lmm_stripe_size: 1048576
lmm_pattern: raid0
lmm_layout_gen: 0
lmm_stripe_offset: 0
obdidx objid objid group
0 1168972 0x11d64c 0xa80000405
3 1167709 0x11d15d 0x980000405
1 1169071 0x11d6af 0xc80000403
11 1172051 0x11e253 0x900000405
103 4342256 0x4241f0 0x440000bd1
14 1172388 0x11e3a4 0xbc0000404
7 1168387 0x11d403 0xb40000405
10 1167154 0x11cf32 0xa40000405
15 1169510 0x11d866 0xb00000405
5 1168023 0x11d297 0x940000405
2 1170305 0x11db81 0xac0000405
4 1168529 0x11d491 0xc00000404
12 1167890 0x11d212 0x9c0000405
6 1169015 0x11d677 0xcc0000405
8 1166910 0x11ce3e 0xa00000405
109 3762900 0x396ad4 0x600000bd2
kjansen@aurora-uan-0009:/lus/gecko/projects/PHASTA_aesp_CNDA/BumpQ2> lfs setstripe -c -1 .
kjansen@aurora-uan-0009:/lus/gecko/projects/PHASTA_aesp_CNDA/BumpQ2> ls -alt |head -5
total 9900951924
-rw-r--r-- 1 kjansen PHASTA_aesp_CNDA 656391 Feb 24 19:01 JZ192Nodes1215_240108.o622004
-rw-r--r-- 1 kjansen PHASTA_aesp_CNDA 445378519 Feb 24 19:01 stats-44000.cgns
drwxr-sr-x 79 kjansen PHASTA_aesp_CNDA 73728 Feb 24 19:01 .
-rw-r--r-- 1 kjansen PHASTA_aesp_CNDA 211362088776 Feb 24 19:01 Q2fromQ1_21k-44000.cgns
kjansen@aurora-uan-0009:/lus/gecko/projects/PHASTA_aesp_CNDA/BumpQ2> mv Q2fromQ1_21k-44000.cgns Q2fromQ1_21k-44000.cgns_asWritten
kjansen@aurora-uan-0009:/lus/gecko/projects/PHASTA_aesp_CNDA/BumpQ2> cp Q2fromQ1_21k-44000.cgns_asWritten Q2fromQ1_21k-44000.cgns
kjansen@aurora-uan-0009:/lus/gecko/projects/PHASTA_aesp_CNDA/BumpQ2> lfs getstripe Q2fromQ1_21k-44000.cgns
Q2fromQ1_21k-44000.cgns
lmm_stripe_count: 32
lmm_stripe_size: 1048576
lmm_pattern: raid0
lmm_layout_gen: 0
lmm_stripe_offset: 1
obdidx objid objid group
1 1169095 0x11d6c7 0xc80000403
110 3835167 0x3a851f 0x680000bd4
3 1167735 0x11d177 0x980000405
13 1170984 0x11de28 0xc40000404
15 1169535 0x11d87f 0xb00000405
0 1168998 0x11d666 0xa80000405
7 1168413 0x11d41d 0xb40000405
5 1168047 0x11d2af 0x940000405
11 1172072 0x11e268 0x900000405
10 1167181 0x11cf4d 0xa40000405
102 4182240 0x3fd0e0 0x4c0000bd1
9 1167850 0x11d1ea 0xb80000405
14 1172410 0x11e3ba 0xbc0000404
4 1168554 0x11d4aa 0xc00000404
2 1170329 0x11db99 0xac0000405
6 1169039 0x11d68f 0xcc0000405
8 1166937 0x11ce59 0xa00000405
104 3737545 0x3907c9 0x500000bd1
12 1167916 0x11d22c 0x9c0000405
105 3952709 0x3c5045 0x640000bd1
106 3894464 0x3b6cc0 0x6c0000bd1
112 3842721 0x3aa2a1 0x400000bd4
107 3878829 0x3b2fad 0x5c0000bd1
108 3862786 0x3af102 0x540000bd2
113 3777100 0x39a24c 0x480000bd4
100 3753931 0x3947cb 0x780000bd1
115 3744821 0x392435 0x740000bd4
114 3779125 0x39aa35 0x7c0000405
111 3672239 0x3808af 0x580000bd1
103 4342896 0x424470 0x440000bd1
109 3763538 0x396d52 0x600000bd2
101 3689664 0x384cc0 0x700000bd1
Jed wrote the writer. I think it is independent if by that you mean each rank is writing its segment through the parallel routines. I modified the existing reader to read in parallel (assuming that is what you mean by independent??).
Two files are written. The first is 197 GB. The second is much smaller but I suspect that it adds latency time that distorts bandwidth numbers a tad. If I do the math right, 197GB/22 sec is about 9GB/s.
I don't know where to find or how to access darshan logs but I know Jed said that Rob Latham would help us take a look at them. I think he is trying to schedule a meeting for that.
I calculated 15-20 GB/s on the smaller node counts (96 or 192), so we're seeing about half that here on 1124 nodes. The stripe size is default, which looks like 1MiB. We use collective IO. I'm working with Rob Latham to get Darshan logs (it's not "supported" on Aurora yet).
According to ALCF, @brtnfld now has access to these project directories. Let me know if you need any orientation beyond the directories and file names I pasted above.
While we have not completed a parallel boundary condition reader/writer, we are able to get everything else (coordinates, volume connectivity, surface connectivity, and solution at least in linear and quadratic meshes) both read and written (as needed to checkpoint and restart) in parallel. The performance is acceptable.
We have had no problems on 96, 192, 384 nodes, each running 12 processes. On the read side we have had no problems on 768 nodes. The write of the volume solution has also never failed. We write a separate file for spanwise average and this is failing about half of the time but so far we suspect the lustre file system is the source of that.
To the real point of the issue, when we went to 1536 nodes or 18432 processes, we are suddenly unable to read the data file written by lower process counts. The file can also be read by lower process counts. The error we get is
CGNS error 1 mismatch in number of children and child IDs read
To be clear this is on Aurora. The mesh has about 2B nodes (quadratic mesh with about 250M hexahedra). Given that the mesh can be read by 96, 192, 384, and 768 nodes we are pretty confident there is no problem with the file but we are suspecting we may be getting into the range were we need to do something more specialized to handle this process count. Our goal will be full Aurora machine runs which would have about 6.5x larger process counts.
DAOS is still a work in progress and we are not yet able to really use that so, for now, we (and I think almost everyone else) are using the lustre file system that I suspect is meant to be a placeholder or backup or stop gap solution. While we welcome help regarding DAOS, for now help getting this file read with Lustre would be welcome.