CGNS / CGNS

The CFD General Notation System (CGNS) provides a standard for recording and recovering computer data associated with the numerical solution of fluid dynamics equations. All development work and bug fixes should be based off the 'develop' branch, CGNS uses the branching model Gitflow.
http://cgns.org/
Other
226 stars 102 forks source link

Read failure on 18,432 processes reports: CGNS error 1 mismatch in number of children and child IDs read #753

Open KennethEJansen opened 8 months ago

KennethEJansen commented 8 months ago

While we have not completed a parallel boundary condition reader/writer, we are able to get everything else (coordinates, volume connectivity, surface connectivity, and solution at least in linear and quadratic meshes) both read and written (as needed to checkpoint and restart) in parallel. The performance is acceptable.

We have had no problems on 96, 192, 384 nodes, each running 12 processes. On the read side we have had no problems on 768 nodes. The write of the volume solution has also never failed. We write a separate file for spanwise average and this is failing about half of the time but so far we suspect the lustre file system is the source of that.

To the real point of the issue, when we went to 1536 nodes or 18432 processes, we are suddenly unable to read the data file written by lower process counts. The file can also be read by lower process counts. The error we get is

CGNS error 1 mismatch in number of children and child IDs read

To be clear this is on Aurora. The mesh has about 2B nodes (quadratic mesh with about 250M hexahedra). Given that the mesh can be read by 96, 192, 384, and 768 nodes we are pretty confident there is no problem with the file but we are suspecting we may be getting into the range were we need to do something more specialized to handle this process count. Our goal will be full Aurora machine runs which would have about 6.5x larger process counts.

DAOS is still a work in progress and we are not yet able to really use that so, for now, we (and I think almost everyone else) are using the lustre file system that I suspect is meant to be a placeholder or backup or stop gap solution. While we welcome help regarding DAOS, for now help getting this file read with Lustre would be welcome.

KennethEJansen commented 8 months ago

@jedbrown @jrwrigh feel free to describe anything that @brtnfld and others developing CGNS might need to know that I missed in the above description

brtnfld commented 8 months ago

Do a few ranks only print that message? Some ranks may not have something to read at that scale, and we may not have the correct bail-out for that situation.

I've run CGNS with 43k ranks with no issue, but that was with 43k ranks reading a file created with 43k ranks. Compiling CGNS with -DADFH_DEBUG_ON, or uncomment #define ADFH_DEBUG_ON in ADFH.c might help with the diagnostics. However, that will produce a ton of output at that number of ranks. It might be helpful to determine the smallest rank count that the problem occurs.

If you can provide me access to the file on Aurora, I can look into it. If you have a simple reproducer, that would also help.

Let me know when you get to the DAOS phase, as CGNS will need the fixes mentioned in #613. I will try to get the fixes in branch CGNS218 into develop. If you continue with Lustre, you will likely want to consider using HDF5 subfiling.

KennethEJansen commented 8 months ago

Thanks for the advice. I did not realize that CGNS did anything differently when reading a file with m processes that was written by n processes when n is not equal to m. I thought there was no concept of prior partition.

Answering your first question

kjansen@aurora-uan-0010:/lus/gecko/projects/PHASTA_aesp_CNDA/BumpQ2> grep "CGNS error 1 mismatch" JZ1536Nodes0515_240108.o618823 |wc 
    48     672    3984

suggests that at least only 48 of the 18432 processes that I expected to participate in reading the file (that is at least how I chunked it out on each read line) are reporting this error but this is also relying on PETSc error reporting e.g.

kjansen@aurora-uan-0010:/lus/gecko/projects/PHASTA_aesp_CNDA/BumpQ2> grep "14163" JZ1536Nodes0515_240108.o618823 |grep -v ": -" |grep -v  ":-"
[14163]PETSC ERROR: Error in external library
[14163]PETSC ERROR: CGNS error 1 mismatch in number of children and child IDs read
[14163]PETSC ERROR: WARNING! There are unused option(s) set! Could be the program crashed before usage or a spelling mistake, etc!
[14163]PETSC ERROR: See https://petsc.org/release/faq/ for trouble shooting.
[14163]PETSC ERROR: Petsc Development GIT revision: v3.19.5-1858-g581ad989054  GIT Date: 2024-02-12 14:59:06 -0700
[14163]PETSC ERROR: Configure options --with-debugging=0 --with-mpiexec-tail=gpu_tile_compact.sh --with-64-bit-indices --with-cc=mpicc --with-cxx=mpicxx --with-fc=0 --COPTFLAGS=-O2 --CXXOPTFLAGS=-O2 --FOPTFLAGS=-O2 --SYCLPPFLAGS=-Wno-tautological-constant-compare --SYCLOPTFLAGS=-O2 --download-kokkos --download-kokkos-kernels --download-kokkos-commit=origin/develop --download-kokkos-kernels-commit=origin/develop --download-hdf5 --download-cgns --download-metis --download-parmetis --download-ptscotch=../scotch_7.0.4beta3.tar.gz --with-sycl --with-syclc=icpx --with-sycl-arch=pvc --PETSC_ARCH=05-15_RB240108_B_JZ
[14163]PETSC ERROR: #1 DMPlexCreateCGNSFromFile_Internal() at /lus/gecko/projects/PHASTA_aesp_CNDA/petsc-kjansen-fork0515006/src/dm/impls/plex/cgns/plexcgns2.c:187
[14163]PETSC ERROR: #2 DMPlexCreateCGNSFromFile() at /lus/gecko/projects/PHASTA_aesp_CNDA/petsc-kjansen-fork0515006/src/dm/impls/plex/plexcgns.c:29
[14163]PETSC ERROR: #3 DMPlexCreateFromFile() at /lus/gecko/projects/PHASTA_aesp_CNDA/petsc-kjansen-fork0515006/src/dm/impls/plex/plexcreate.c:5921
[14163]PETSC ERROR: #4 DMPlexCreateFromOptions_Internal() at /lus/gecko/projects/PHASTA_aesp_CNDA/petsc-kjansen-fork0515006/src/dm/impls/plex/plexcreate.c:3943
[14163]PETSC ERROR: #5 DMSetFromOptions_Plex() at /lus/gecko/projects/PHASTA_aesp_CNDA/petsc-kjansen-fork0515006/src/dm/impls/plex/plexcreate.c:4465
[14163]PETSC ERROR: #6 DMSetFromOptions() at /lus/gecko/projects/PHASTA_aesp_CNDA/petsc-kjansen-fork0515006/src/dm/interface/dm.c:905
[14163]PETSC ERROR: #7 CreateDM() at /lus/gecko/projects/PHASTA_aesp_CNDA/libCEED_0515006_240108_JZ/examples/fluids/src/setupdm.c:36
[14163]PETSC ERROR: #8 main() at /lus/gecko/projects/PHASTA_aesp_CNDA/libCEED_0515006_240108_JZ/examples/fluids/navierstokes.c:159
[14163]PETSC ERROR: PETSc Option Table entries:
Abort(76) on node 14163 (rank 0 in comm 16): application called MPI_Abort(MPI_COMM_SELF, 76) - process 0

so there are 48 that have the CGNS reported error

kjansen@aurora-uan-0010:/lus/gecko/projects/PHASTA_aesp_CNDA/BumpQ2> grep "CGNS error 1 mismatch" JZ1536Nodes0515_240108.o618823
[14163]PETSC ERROR: CGNS error 1 mismatch in number of children and child IDs read
[14174]PETSC ERROR: CGNS error 1 mismatch in number of children and child IDs read
[14164]PETSC ERROR: CGNS error 1 mismatch in number of children and child IDs read
[14175]PETSC ERROR: CGNS error 1 mismatch in number of children and child IDs read
[14167]PETSC ERROR: CGNS error 1 mismatch in number of children and child IDs read
[14179]PETSC ERROR: CGNS error 1 mismatch in number of children and child IDs read
[14168]PETSC ERROR: CGNS error 1 mismatch in number of children and child IDs read
[14180]PETSC ERROR: CGNS error 1 mismatch in number of children and child IDs read
[14182]PETSC ERROR: CGNS error 1 mismatch in number of children and child IDs read
[14171]PETSC ERROR: CGNS error 1 mismatch in number of children and child IDs read
[14172]PETSC ERROR: CGNS error 1 mismatch in number of children and child IDs read
[14160]PETSC ERROR: CGNS error 1 mismatch in number of children and child IDs read
[14173]PETSC ERROR: CGNS error 1 mismatch in number of children and child IDs read
[14161]PETSC ERROR: CGNS error 1 mismatch in number of children and child IDs read
[14162]PETSC ERROR: CGNS error 1 mismatch in number of children and child IDs read
[14176]PETSC ERROR: CGNS error 1 mismatch in number of children and child IDs read
[14177]PETSC ERROR: CGNS error 1 mismatch in number of children and child IDs read
[14178]PETSC ERROR: CGNS error 1 mismatch in number of children and child IDs read
[14165]PETSC ERROR: CGNS error 1 mismatch in number of children and child IDs read
[14181]PETSC ERROR: CGNS error 1 mismatch in number of children and child IDs read
[14166]PETSC ERROR: CGNS error 1 mismatch in number of children and child IDs read
[14183]PETSC ERROR: CGNS error 1 mismatch in number of children and child IDs read
[14169]PETSC ERROR: CGNS error 1 mismatch in number of children and child IDs read
[14170]PETSC ERROR: CGNS error 1 mismatch in number of children and child IDs read
[14184]PETSC ERROR: CGNS error 1 mismatch in number of children and child IDs read
[14185]PETSC ERROR: CGNS error 1 mismatch in number of children and child IDs read
[14186]PETSC ERROR: CGNS error 1 mismatch in number of children and child IDs read
[14196]PETSC ERROR: CGNS error 1 mismatch in number of children and child IDs read
[14187]PETSC ERROR: CGNS error 1 mismatch in number of children and child IDs read
[14197]PETSC ERROR: CGNS error 1 mismatch in number of children and child IDs read
[14188]PETSC ERROR: CGNS error 1 mismatch in number of children and child IDs read
[14198]PETSC ERROR: CGNS error 1 mismatch in number of children and child IDs read
[14189]PETSC ERROR: CGNS error 1 mismatch in number of children and child IDs read
[14199]PETSC ERROR: CGNS error 1 mismatch in number of children and child IDs read
[14190]PETSC ERROR: CGNS error 1 mismatch in number of children and child IDs read
[14200]PETSC ERROR: CGNS error 1 mismatch in number of children and child IDs read
[14191]PETSC ERROR: CGNS error 1 mismatch in number of children and child IDs read
[14201]PETSC ERROR: CGNS error 1 mismatch in number of children and child IDs read
[14192]PETSC ERROR: CGNS error 1 mismatch in number of children and child IDs read
[14202]PETSC ERROR: CGNS error 1 mismatch in number of children and child IDs read
[14193]PETSC ERROR: CGNS error 1 mismatch in number of children and child IDs read
[14203]PETSC ERROR: CGNS error 1 mismatch in number of children and child IDs read
[14194]PETSC ERROR: CGNS error 1 mismatch in number of children and child IDs read
[14204]PETSC ERROR: CGNS error 1 mismatch in number of children and child IDs read
[14195]PETSC ERROR: CGNS error 1 mismatch in number of children and child IDs read
[14205]PETSC ERROR: CGNS error 1 mismatch in number of children and child IDs read
[14206]PETSC ERROR: CGNS error 1 mismatch in number of children and child IDs read
[14207]PETSC ERROR: CGNS error 1 mismatch in number of children and child IDs read

I have not sorted but the tight node range might be just 4 nodes with 12 processes are not happy? @jedbrown or @jrwrigh will know better but, for example

kjansen@aurora-uan-0010:/lus/gecko/projects/PHASTA_aesp_CNDA/BumpQ2> grep "13163" JZ1536Nodes0515_240108.o618823 

Returns nothing so it seems the other processes (other 762 nodes if these are indeed all on the same 4 nodes) are not getting this error or at least not reporting it through PETSc.

Can you explain Some ranks may not have something to read at that scale? I think we spread out the node and element read range evenly, even for boundary elements (which was a rendezvous headache to get them back on ranks that have better correspondence to their node range).

Thanks for the flags to get more verbose output even if it will be a huge haystack to sift for the needle.

Since it works fine at 96,192,384, and 768 nodes, I don't know how to make a small reproducer.

If you have an account on Aurora, give me your username and I will ask support to add you to our group as these files are in our group space and readable by anyone I add.

Can you point us to information on HDF5 subfiling? This might be more promising than debugging a case that is beyond the limits of Lustre.

brtnfld commented 8 months ago

You are correct. By default, there are no rank dependencies for a CGNS file unless an application introduces such dependencies, such as different zones for each rank.

I wanted to know if the data was partitioned for the larger scale case such that it could be a scenario where some ranks might not have some data condition to read.

Do you always need to double the nodes for the next rank jump? For example, can't you run with 576 nodes?

General subfiling info is here: https://github.com/HDFGroup/hdf5doc/blob/master/RFCs/HDF5_Library/VFD_Subfiling/user_guide/HDF5_Subfiling_VFD_User_s_Guide.pdf

I've not merged the CGNS "subfiling" branch into develop. I've tested it on Summit and Frontier and will have some Aurora results shortly. I still need to document its usage and best practices.

If you list "ls -tr" home on Aurora, my username is obvious. Otherwise, I can send it to you offline.

Which version of CGNS and HDF5 are you using?

KennethEJansen commented 8 months ago

Thanks again for the response.

Our file is "flat" in the sense that it is a single zone and we are expecting all ranks to read a range of the data that is size/nranks.

No requirement to double its just what I usually do. In any event, 1536 is not as big as we want to go but sure, we can try 1024 or any other number between works at 768 and fails at 1536.

Thanks for the link and the "status" and yes eager for documentation on its usage and best practices as I am very much a CGNS newbie (who dove into the Exascale usage as the first experience).

I will find your username and request that you be added to our projects shortly.

KennethEJansen commented 8 months ago

Request for you to be added sent but no response yet so it might be a while. In the interim @jedbrown suggested

lfs setstripe -c 16 .

to set the directory's Lustre properties, copying the file such that it gets those properties matched, and we are testing to see if that improves things. Do you have any advice as to whether those are the best settings for Aurora?

brtnfld commented 8 months ago

A stripe count of 16 is a good starting point; I've seen good results on Frontier with a stripe count of 64 and a stripe size of 16 MiB.

Which version of HDF5 are you using?

KennethEJansen commented 8 months ago

In the spirit of push it until it breaks mode, @jedbrown suggested -1 and this produced a hang with 192 nodes (each with 12 processes) reading a file written originally with 16 but then "copied" after setting the dir to -1

#0  0x00001502efe3cee1 in MPIR_Allreduce () from /soft/restricted/CNDA/updates/mpich/52.2/mpich-ofi-all-icc-default-pmix-gpu-drop52/lib/libmpi.so.12
No symbol table info available.
#1  0x00001502ef445c7e in PMPI_Allreduce () from /soft/restricted/CNDA/updates/mpich/52.2/mpich-ofi-all-icc-default-pmix-gpu-drop52/lib/libmpi.so.12
No symbol table info available.
#2  0x00001502f38a88f3 in PMPI_File_set_view () from /soft/restricted/CNDA/updates/mpich/52.2/mpich-ofi-all-icc-default-pmix-gpu-drop52/lib/libmpi.so.12
No symbol table info available.
#3  0x00001502e3855c61 in ?? () from /lus/gecko/projects/PHASTA_aesp_CNDA/petsc-kjansen-forkJZ/12-15-001_RB240108_precise/lib/libhdf5.so.310
No symbol table info available.
#4  0x00001502e35fce32 in H5FD_read_selection () from /lus/gecko/projects/PHASTA_aesp_CNDA/petsc-kjansen-forkJZ/12-15-001_RB240108_precise/lib/libhdf5.so.310
No symbol table info available.
#5  0x00001502e35e3430 in H5F_shared_select_read () from /lus/gecko/projects/PHASTA_aesp_CNDA/petsc-kjansen-forkJZ/12-15-001_RB240108_precise/lib/libhdf5.so.310
No symbol table info available.
#6  0x00001502e35902bf in H5D__contig_read () from /lus/gecko/projects/PHASTA_aesp_CNDA/petsc-kjansen-forkJZ/12-15-001_RB240108_precise/lib/libhdf5.so.310
No symbol table info available.
#7  0x00001502e35a4c7b in H5D__read () from /lus/gecko/projects/PHASTA_aesp_CNDA/petsc-kjansen-forkJZ/12-15-001_RB240108_precise/lib/libhdf5.so.310
No symbol table info available.
#8  0x00001502e380f7ec in H5VL__native_dataset_read () from /lus/gecko/projects/PHASTA_aesp_CNDA/petsc-kjansen-forkJZ/12-15-001_RB240108_precise/lib/libhdf5.so.310
No symbol table info available.
#9  0x00001502e37fade3 in H5VL_dataset_read_direct () from /lus/gecko/projects/PHASTA_aesp_CNDA/petsc-kjansen-forkJZ/12-15-001_RB240108_precise/lib/libhdf5.so.310
No symbol table info available.
#10 0x00001502e357514e in ?? () from /lus/gecko/projects/PHASTA_aesp_CNDA/petsc-kjansen-forkJZ/12-15-001_RB240108_precise/lib/libhdf5.so.310
No symbol table info available.
#11 0x00001502e3574c9d in H5Dread () from /lus/gecko/projects/PHASTA_aesp_CNDA/petsc-kjansen-forkJZ/12-15-001_RB240108_precise/lib/libhdf5.so.310
No symbol table info available.
#12 0x00001502e39f4cde in readwrite_data_parallel () from /lus/gecko/projects/PHASTA_aesp_CNDA/petsc-kjansen-forkJZ/12-15-001_RB240108_precise/lib/libcgns.so.4.3
No symbol table info available.
#13 0x00001502e39f601a in cgp_elements_read_data () from /lus/gecko/projects/PHASTA_aesp_CNDA/petsc-kjansen-forkJZ/12-15-001_RB240108_precise/lib/libcgns.so.4.3
No symbol table info available.
#14 0x000015030dc421e0 in DMPlexCreateCGNS_Internal () from /lus/gecko/projects/PHASTA_aesp_CNDA/petsc-kjansen-forkJZ/12-15-001_RB240108_precise/lib/libpetsc.so.3.020
No symbol table info available.
#15 0x000015030daaacc6 in DMPlexCreateCGNS () from /lus/gecko/projects/PHASTA_aesp_CNDA/petsc-kjansen-forkJZ/12-15-001_RB240108_precise/lib/libpetsc.so.3.020
No symbol table info available.
#16 0x000015030dc418b5 in DMPlexCreateCGNSFromFile_Internal () from /lus/gecko/projects/PHASTA_aesp_CNDA/petsc-kjansen-forkJZ/12-15-001_RB240108_precise/lib/libpetsc.so.3.020
No symbol table info available.
#17 0x000015030daaac76 in DMPlexCreateCGNSFromFile () from /lus/gecko/projects/PHASTA_aesp_CNDA/petsc-kjansen-forkJZ/12-15-001_RB240108_precise/lib/libpetsc.so.3.020
No symbol table info available.
#18 0x000015030daccfd0 in DMPlexCreateFromFile () from /lus/gecko/projects/PHASTA_aesp_CNDA/petsc-kjansen-forkJZ/12-15-001_RB240108_precise/lib/libpetsc.so.3.020
No symbol table info available.
#19 0x000015030dad4fc7 in DMSetFromOptions_Plex () from /lus/gecko/projects/PHASTA_aesp_CNDA/petsc-kjansen-forkJZ/12-15-001_RB240108_precise/lib/libpetsc.so.3.020
No symbol table info available.
#20 0x000015030d9435f9 in DMSetFromOptions () from /lus/gecko/projects/PHASTA_aesp_CNDA/petsc-kjansen-forkJZ/12-15-001_RB240108_precise/lib/libpetsc.so.3.020
No symbol table info available.
#21 0x000000000047d99d in CreateDM ()
No symbol table info available.
#22 0x000000000040c2b1 in main ()
No symbol table info available.
[Inferior 1 (process 4399) detached]
[New LWP 4476]
[New LWP 4488]

I have 12 of these in each of the 192 node files for us to digest.

Answering your question, PETSc "chooses" the version of HDF5 and it is: hdf5-1.14.3-p1

or from the configure log

============================================================================================= Trying to download https://web.cels.anl.gov/projects/petsc/download/externalpackages/hdf5-1.14.3-p1.tar.bz2 for HDF5

                    install: Retrieving https://web.cels.anl.gov/projects/petsc/download/externalpackages/hdf5-1.14.3-p1.tar.bz2 as tarball to /lus/gecko/projects/PHASTA_aesp_CNDA/petsc-kjansen-forkJZ/12-15-001_RB240108_B/externalpackages/_d_hdf5-1.14.3-p1.tar.bz2

/hdf

KennethEJansen commented 8 months ago
kjansen@aurora-uan-0009:~> grep "#0 " out192ReadHang..* |grep MPIR_Allreduce |wc
   1496   10472  251977

so that leaves 808 processes (12*192-1496) doing something else.

KennethEJansen commented 8 months ago

For reasons we are still sorting out, we seem to get 12 control processes as well but filtering these with what they seem to all be doing on BT #0 which is wait4:

kjansen@aurora-uan-0009:~> grep "#0 " out192ReadHang..* |grep -v MPIR_Allreduce | grep -v wait4  |head
out192ReadHang..0:#0  0x00001471d09030a9 in poll () from /lib64/libc.so.6
out192ReadHang..0:#0  ofi_genlock_lock (lock=0x4d63490) at ./include/ofi_lock.h:359
out192ReadHang..0:#0  0x0000148ddc7403e0 in ofi_mutex_unlock_noop () at src/common.c:996
out192ReadHang..0:#0  0x000015230a8dc9b3 in pthread_spin_lock () from /lib64/libpthread.so.0
out192ReadHang..1:#0  0x000014abd7887d19 in MPIDU_genq_shmem_queue_dequeue () from /soft/restricted/CNDA/updates/mpich/52.2/mpich-ofi-all-icc-default-pmix-gpu-drop52/lib/libmpi.so.12
out192ReadHang..1:#0  0x0000150ec7027af4 in cxip_ep_ctrl_eq_progress (ep_obj=0x4981620, ctrl_evtq=0x4947618, tx_evtq=true, ep_obj_locked=false) at prov/cxi/src/cxip_ctrl.c:320
out192ReadHang..1:#0  cxip_ep_ctrl_eq_progress (ep_obj=0x4ec9900, ctrl_evtq=0x4ed3698, tx_evtq=false, ep_obj_locked=false) at prov/cxi/src/cxip_ctrl.c:320
out192ReadHang..1:#0  0x0000152af605d0ed in ofi_cq_readfrom (cq_fid=0x5ce1290, buf=<optimized out>, count=8, src_addr=0x0) at prov/util/src/util_cq.c:230
out192ReadHang..1:#0  cxip_ep_ctrl_progress (ep_obj=0x5583110) at prov/cxi/src/cxip_ctrl.c:374
out192ReadHang..10:#0  0x0000146f67dd2b90 in cxip_ep_ctrl_eq_progress (ep_obj=<optimized out>, ctrl_evtq=<optimized out>, tx_evtq=true, ep_obj_locked=<optimized out>) at prov/cxi/src/cxip_ctrl.c:355
kjansen@aurora-uan-0009:~> grep "#0 " out192ReadHang..* |grep -v MPIR_Allreduce | grep -v wait4  |wc
    809    6030  100729
diving in on the variation of states for node 0 to see if that tells us anything (here us means somebody else because if I understood it I would not be dumping all this stuff here).
```
kjansen@aurora-uan-0009:~> grep "#1 " out192ReadHang..0

1 0x00001471d0a5d0dd in zmq_poll () from /usr/lib64/libzmq.so.5

1 0x0000560229cbcce6 in ?? ()

1 0x00005593736bcce6 in ?? ()

1 0x000055bf552bcce6 in ?? ()

1 0x000055807fabcce6 in ?? ()

1 0x0000564af9abcce6 in ?? ()

1 0x000055e712ebcce6 in ?? ()

1 0x000055f4986bcce6 in ?? ()

1 0x0000561f4f2bcce6 in ?? ()

1 0x0000557751cbcce6 in ?? ()

1 0x00005637044bcce6 in ?? ()

1 0x00005582ecebcce6 in ?? ()

1 0x000055a2240bcce6 in ?? ()

1 0x00001502ef445c7e in PMPI_Allreduce () from /soft/restricted/CNDA/updates/mpich/52.2/mpich-ofi-all-icc-default-pmix-gpu-drop52/lib/libmpi.so.12

1 0x000015532c9b7c7e in PMPI_Allreduce () from /soft/restricted/CNDA/updates/mpich/52.2/mpich-ofi-all-icc-default-pmix-gpu-drop52/lib/libmpi.so.12

1 0x00001502c757bc7e in PMPI_Allreduce () from /soft/restricted/CNDA/updates/mpich/52.2/mpich-ofi-all-icc-default-pmix-gpu-drop52/lib/libmpi.so.12

1 0x00001554066e9c7e in PMPI_Allreduce () from /soft/restricted/CNDA/updates/mpich/52.2/mpich-ofi-all-icc-default-pmix-gpu-drop52/lib/libmpi.so.12

1 0x0000152e8cd46c7e in PMPI_Allreduce () from /soft/restricted/CNDA/updates/mpich/52.2/mpich-ofi-all-icc-default-pmix-gpu-drop52/lib/libmpi.so.12

1 0x000014d1302c9c7e in PMPI_Allreduce () from /soft/restricted/CNDA/updates/mpich/52.2/mpich-ofi-all-icc-default-pmix-gpu-drop52/lib/libmpi.so.12

1 0x0000148e12f7ec7e in PMPI_Allreduce () from /soft/restricted/CNDA/updates/mpich/52.2/mpich-ofi-all-icc-default-pmix-gpu-drop52/lib/libmpi.so.12

1 ofi_cq_readfrom (cq_fid=0x4d63420, buf=0x7ffee0d1f2e0, count=8, src_addr=0x0) at prov/util/src/util_cq.c:229

1 0x0000148ddc75627b in ofi_genlock_unlock (lock=0x4764660) at ./include/ofi_lock.h:364

1 0x00001520e3342c7e in PMPI_Allreduce () from /soft/restricted/CNDA/updates/mpich/52.2/mpich-ofi-all-icc-default-pmix-gpu-drop52/lib/libmpi.so.12

1 0x00001524c1fa1c7e in PMPI_Allreduce () from /soft/restricted/CNDA/updates/mpich/52.2/mpich-ofi-all-icc-default-pmix-gpu-drop52/lib/libmpi.so.12

1 0x00001523066c84d0 in cxip_cq_progress (cq=0x45b4610) at prov/cxi/src/cxip_cq.c:545

kjansen@aurora-uan-0009:~> grep "#2 " out192ReadHang..0

2 0x0000562828c04fc4 in event_loop () at src/mpiexec/mpiexec.c:949

2 0x0000560229cbf526 in wait_for ()

2 0x00005593736bf526 in wait_for ()

2 0x000055bf552bf526 in wait_for ()

2 0x000055807fabf526 in wait_for ()

2 0x0000564af9abf526 in wait_for ()

2 0x000055e712ebf526 in wait_for ()

2 0x000055f4986bf526 in wait_for ()

2 0x0000561f4f2bf526 in wait_for ()

2 0x0000557751cbf526 in wait_for ()

2 0x00005637044bf526 in wait_for ()

2 0x00005582ecebf526 in wait_for ()

2 0x000055a2240bf526 in wait_for ()

2 0x00001502f38a88f3 in PMPI_File_set_view () from /soft/restricted/CNDA/updates/mpich/52.2/mpich-ofi-all-icc-default-pmix-gpu-drop52/lib/libmpi.so.12

2 0x0000155330e1a8f3 in PMPI_File_set_view () from /soft/restricted/CNDA/updates/mpich/52.2/mpich-ofi-all-icc-default-pmix-gpu-drop52/lib/libmpi.so.12

2 0x00001502cb9de8f3 in PMPI_File_set_view () from /soft/restricted/CNDA/updates/mpich/52.2/mpich-ofi-all-icc-default-pmix-gpu-drop52/lib/libmpi.so.12

2 0x000015540ab4c8f3 in PMPI_File_set_view () from /soft/restricted/CNDA/updates/mpich/52.2/mpich-ofi-all-icc-default-pmix-gpu-drop52/lib/libmpi.so.12

2 0x0000152e911a98f3 in PMPI_File_set_view () from /soft/restricted/CNDA/updates/mpich/52.2/mpich-ofi-all-icc-default-pmix-gpu-drop52/lib/libmpi.so.12

2 0x000014d13472c8f3 in PMPI_File_set_view () from /soft/restricted/CNDA/updates/mpich/52.2/mpich-ofi-all-icc-default-pmix-gpu-drop52/lib/libmpi.so.12

2 0x0000148e173e18f3 in PMPI_File_set_view () from /soft/restricted/CNDA/updates/mpich/52.2/mpich-ofi-all-icc-default-pmix-gpu-drop52/lib/libmpi.so.12

2 0x000014da51320929 in MPIDI_progress_test () from /soft/restricted/CNDA/updates/mpich/52.2/mpich-ofi-all-icc-default-pmix-gpu-drop52/lib/libmpi.so.12

2 ofi_cq_readfrom (cq_fid=0x47645f0, buf=, count=, src_addr=0x0) at prov/util/src/util_cq.c:280

2 0x00001520e77a58f3 in PMPI_File_set_view () from /soft/restricted/CNDA/updates/mpich/52.2/mpich-ofi-all-icc-default-pmix-gpu-drop52/lib/libmpi.so.12

2 0x00001524c64048f3 in PMPI_File_set_view () from /soft/restricted/CNDA/updates/mpich/52.2/mpich-ofi-all-icc-default-pmix-gpu-drop52/lib/libmpi.so.12

2 0x00001523066c8c79 in cxip_util_cq_progress (util_cq=0x45b4610) at prov/cxi/src/cxip_cq.c:563

kjansen@aurora-uan-0009:~> grep "#3 " out192ReadHang..0

3 launch_apps () at src/mpiexec/mpiexec.c:1026

3 0x0000560229c83e23 in execute_command_internal ()

3 0x0000559373683e23 in execute_command_internal ()

3 0x000055bf55283e23 in execute_command_internal ()

3 0x000055807fa83e23 in execute_command_internal ()

3 0x0000564af9a83e23 in execute_command_internal ()

3 0x000055e712e83e23 in execute_command_internal ()

3 0x000055f498683e23 in execute_command_internal ()

3 0x0000561f4f283e23 in execute_command_internal ()

3 0x0000557751c83e23 in execute_command_internal ()

3 0x0000563704483e23 in execute_command_internal ()

3 0x00005582ece83e23 in execute_command_internal ()

3 0x000055a224083e23 in execute_command_internal ()

3 0x00001502e3855c61 in ?? () from /lus/gecko/projects/PHASTA_aesp_CNDA/petsc-kjansen-forkJZ/12-15-001_RB240108_precise/lib/libhdf5.so.310

3 0x0000155320dc7c61 in ?? () from /lus/gecko/projects/PHASTA_aesp_CNDA/petsc-kjansen-forkJZ/12-15-001_RB240108_precise/lib/libhdf5.so.310

3 0x00001502bb98bc61 in ?? () from /lus/gecko/projects/PHASTA_aesp_CNDA/petsc-kjansen-forkJZ/12-15-001_RB240108_precise/lib/libhdf5.so.310

3 0x00001553faaf9c61 in ?? () from /lus/gecko/projects/PHASTA_aesp_CNDA/petsc-kjansen-forkJZ/12-15-001_RB240108_precise/lib/libhdf5.so.310

3 0x0000152e81156c61 in ?? () from /lus/gecko/projects/PHASTA_aesp_CNDA/petsc-kjansen-forkJZ/12-15-001_RB240108_precise/lib/libhdf5.so.310

3 0x000014d1246d9c61 in ?? () from /lus/gecko/projects/PHASTA_aesp_CNDA/petsc-kjansen-forkJZ/12-15-001_RB240108_precise/lib/libhdf5.so.310

3 0x0000148e0738ec61 in ?? () from /lus/gecko/projects/PHASTA_aesp_CNDA/petsc-kjansen-forkJZ/12-15-001_RB240108_precise/lib/libhdf5.so.310

3 0x000014da512fdeda in MPIR_Allreduce () from /soft/restricted/CNDA/updates/mpich/52.2/mpich-ofi-all-icc-default-pmix-gpu-drop52/lib/libmpi.so.12

3 0x0000148dece92559 in MPIDI_progress_test () from /soft/restricted/CNDA/updates/mpich/52.2/mpich-ofi-all-icc-default-pmix-gpu-drop52/lib/libmpi.so.12

3 0x00001520d7752c61 in ?? () from /lus/gecko/projects/PHASTA_aesp_CNDA/petsc-kjansen-forkJZ/12-15-001_RB240108_precise/lib/libhdf5.so.310

3 0x00001524b63b1c61 in ?? () from /lus/gecko/projects/PHASTA_aesp_CNDA/petsc-kjansen-forkJZ/12-15-001_RB240108_precise/lib/libhdf5.so.310

3 0x00001523066a4111 in ofi_cq_readfrom (cq_fid=0x45b4610, buf=, count=8, src_addr=0x0) at prov/util/src/util_cq.c:232

kjansen@aurora-uan-0009:~> grep "#4 " out192ReadHang..0

4 launch_loop () at src/mpiexec/mpiexec.c:1050

4 0x0000560229c846d1 in execute_command ()

4 0x00005593736846d1 in execute_command ()

4 0x000055bf552846d1 in execute_command ()

4 0x000055807fa846d1 in execute_command ()

4 0x0000564af9a846d1 in execute_command ()

4 0x000055e712e846d1 in execute_command ()

4 0x000055f4986846d1 in execute_command ()

4 0x0000561f4f2846d1 in execute_command ()

4 0x0000557751c846d1 in execute_command ()

4 0x00005637044846d1 in execute_command ()

4 0x00005582ece846d1 in execute_command ()

4 0x000055a2240846d1 in execute_command ()

4 0x00001502e35fce32 in H5FD_read_selection () from /lus/gecko/projects/PHASTA_aesp_CNDA/petsc-kjansen-forkJZ/12-15-001_RB240108_precise/lib/libhdf5.so.310

4 0x0000155320b6ee32 in H5FD_read_selection () from /lus/gecko/projects/PHASTA_aesp_CNDA/petsc-kjansen-forkJZ/12-15-001_RB240108_precise/lib/libhdf5.so.310

4 0x00001502bb732e32 in H5FD_read_selection () from /lus/gecko/projects/PHASTA_aesp_CNDA/petsc-kjansen-forkJZ/12-15-001_RB240108_precise/lib/libhdf5.so.310

4 0x00001553fa8a0e32 in H5FD_read_selection () from /lus/gecko/projects/PHASTA_aesp_CNDA/petsc-kjansen-forkJZ/12-15-001_RB240108_precise/lib/libhdf5.so.310

4 0x0000152e80efde32 in H5FD_read_selection () from /lus/gecko/projects/PHASTA_aesp_CNDA/petsc-kjansen-forkJZ/12-15-001_RB240108_precise/lib/libhdf5.so.310

4 0x000014d124480e32 in H5FD_read_selection () from /lus/gecko/projects/PHASTA_aesp_CNDA/petsc-kjansen-forkJZ/12-15-001_RB240108_precise/lib/libhdf5.so.310

4 0x0000148e07135e32 in H5FD_read_selection () from /lus/gecko/projects/PHASTA_aesp_CNDA/petsc-kjansen-forkJZ/12-15-001_RB240108_precise/lib/libhdf5.so.310

4 0x000014da50906c7e in PMPI_Allreduce () from /soft/restricted/CNDA/updates/mpich/52.2/mpich-ofi-all-icc-default-pmix-gpu-drop52/lib/libmpi.so.12

4 0x0000148dece823fd in MPIR_Wait_state () from /soft/restricted/CNDA/updates/mpich/52.2/mpich-ofi-all-icc-default-pmix-gpu-drop52/lib/libmpi.so.12

4 0x00001520d74f9e32 in H5FD_read_selection () from /lus/gecko/projects/PHASTA_aesp_CNDA/petsc-kjansen-forkJZ/12-15-001_RB240108_precise/lib/libhdf5.so.310

4 0x00001524b6158e32 in H5FD_read_selection () from /lus/gecko/projects/PHASTA_aesp_CNDA/petsc-kjansen-forkJZ/12-15-001_RB240108_precise/lib/libhdf5.so.310

4 0x0000152316c1d929 in MPIDI_progress_test () from /soft/restricted/CNDA/updates/mpich/52.2/mpich-ofi-all-icc-default-pmix-gpu-drop52/lib/libmpi.so.12

KennethEJansen commented 8 months ago

Slicing the other way, here is where the first 200ish of the 809 processes that are not at the MPIR_Allreduce in case that tells anyone anything (I can provide more obviously but unsure how helpful this is.

grep "#0 " out192ReadHang..* |grep -v MPIR_Allreduce | grep -v wait4  >  809DoingWhat.log

pasted lines from 809DoingWhat.log

[?4m2ReadHang..0:#0  0x00001471d09030a9 in poll () from /lib64/libc.so.6
out192ReadHang..0:#0  ofi_genlock_lock (lock=0x4d63490) at ./include/ofi_lock.h:359
out192ReadHang..0:#0  0x0000148ddc7403e0 in ofi_mutex_unlock_noop () at src/common.c:996
out192ReadHang..0:#0  0x000015230a8dc9b3 in pthread_spin_lock () from /lib64/libpthread.so.0
out192ReadHang..1:#0  0x000014abd7887d19 in MPIDU_genq_shmem_queue_dequeue () from /soft/restricted/CNDA/updates/mpich/52.2/mpich-ofi-all-icc-default-pmix-gpu-drop52/lib/libmpi.so.12
out192ReadHang..1:#0  0x0000150ec7027af4 in cxip_ep_ctrl_eq_progress (ep_obj=0x4981620, ctrl_evtq=0x4947618, tx_evtq=true, ep_obj_locked=false) at prov/cxi/src/cxip_ctrl.c:320
out192ReadHang..1:#0  cxip_ep_ctrl_eq_progress (ep_obj=0x4ec9900, ctrl_evtq=0x4ed3698, tx_evtq=false, ep_obj_locked=false) at prov/cxi/src/cxip_ctrl.c:320
out192ReadHang..1:#0  0x0000152af605d0ed in ofi_cq_readfrom (cq_fid=0x5ce1290, buf=<optimized out>, count=8, src_addr=0x0) at prov/util/src/util_cq.c:230
out192ReadHang..1:#0  cxip_ep_ctrl_progress (ep_obj=0x5583110) at prov/cxi/src/cxip_ctrl.c:374
out192ReadHang..10:#0  0x0000146f67dd2b90 in cxip_ep_ctrl_eq_progress (ep_obj=<optimized out>, ctrl_evtq=<optimized out>, tx_evtq=true, ep_obj_locked=<optimized out>) at prov/cxi/src/cxip_ctrl.c:355
out192ReadHang..10:#0  ofi_mutex_lock_noop (lock=0x5ec5668) at ./include/ofi_lock.h:295
out192ReadHang..10:#0  0x000014aaec80c0bf in ofi_cq_readfrom (cq_fid=0x50af420, buf=0x7ffc37947580, count=8, src_addr=0x0) at prov/util/src/util_cq.c:221
out192ReadHang..10:#0  0x0000145cac7e3566 in _dl_update_slotinfo () from /lib64/ld-linux-x86-64.so.2
out192ReadHang..10:#0  0x000014c0aebf10f1 in ofi_cq_readfrom (cq_fid=0x519d460, buf=<optimized out>, count=8, src_addr=0x0) at prov/util/src/util_cq.c:230
out192ReadHang..10:#0  0x00001539b2d949b3 in pthread_spin_lock () from /lib64/libpthread.so.0
out192ReadHang..100:#0  cxip_cq_progress (cq=0x409d290) at prov/cxi/src/cxip_cq.c:554
out192ReadHang..100:#0  cxip_util_cq_progress (util_cq=0x57e7460) at prov/cxi/src/cxip_cq.c:566
out192ReadHang..100:#0  0x0000145664c7a919 in MPIDI_progress_test () from /soft/restricted/CNDA/updates/mpich/52.2/mpich-ofi-all-icc-default-pmix-gpu-drop52/lib/libmpi.so.12
out192ReadHang..100:#0  0x000014d4af3acc30 in pthread_spin_lock@plt () from /opt/cray/libfabric/1.15.2.0/lib64/libfabric.so.1
out192ReadHang..100:#0  0x000014d9a26a40ed in ofi_cq_readfrom (cq_fid=0x56df250, buf=<optimized out>, count=8, src_addr=0x0) at prov/util/src/util_cq.c:230
out192ReadHang..100:#0  0x0000153bdb63a9b3 in pthread_spin_lock () from /lib64/libpthread.so.0
out192ReadHang..101:#0  0x0000150e3e649d10 in MPIDU_genq_shmem_queue_dequeue () from /soft/restricted/CNDA/updates/mpich/52.2/mpich-ofi-all-icc-default-pmix-gpu-drop52/lib/libmpi.so.12
out192ReadHang..101:#0  cxip_ep_ctrl_eq_progress (ep_obj=0x42d5660, ctrl_evtq=0x42a0d98, tx_evtq=false, ep_obj_locked=false) at prov/cxi/src/cxip_ctrl.c:320
out192ReadHang..101:#0  0x0000146f7d5038b4 in MPIDI_progress_test () from /soft/restricted/CNDA/updates/mpich/52.2/mpich-ofi-all-icc-default-pmix-gpu-drop52/lib/libmpi.so.12
out192ReadHang..101:#0  0x0000147d3d6120f1 in ofi_cq_readfrom (cq_fid=0x47a1330, buf=<optimized out>, count=8, src_addr=0x0) at prov/util/src/util_cq.c:230
out192ReadHang..102:#0  0x00001525be2920bf in ofi_cq_readfrom (cq_fid=0x5088420, buf=0x7ffdd9447b20, count=8, src_addr=0x0) at prov/util/src/util_cq.c:221
out192ReadHang..102:#0  0x000014899009b9b3 in pthread_spin_lock () from /lib64/libpthread.so.0
out192ReadHang..102:#0  0x000015489fce19b3 in pthread_spin_lock () from /lib64/libpthread.so.0
out192ReadHang..102:#0  cxip_cq_progress (cq=0x5d019a0) at prov/cxi/src/cxip_cq.c:545
out192ReadHang..102:#0  0x000014f4800f59b5 in pthread_spin_lock () from /lib64/libpthread.so.0
out192ReadHang..102:#0  0x000014ee3525c9b3 in pthread_spin_lock () from /lib64/libpthread.so.0
out192ReadHang..103:#0  0x0000154218fdf9b3 in pthread_spin_lock () from /lib64/libpthread.so.0
out192ReadHang..103:#0  0x0000150b35b54346 in MPIDI_OFI_gpu_progress_task () from /soft/restricted/CNDA/updates/mpich/52.2/mpich-ofi-all-icc-default-pmix-gpu-drop52/lib/libmpi.so.12
out192ReadHang..103:#0  0x0000154218fdf9b3 in pthread_spin_lock () from /lib64/libpthread.so.0
out192ReadHang..103:#0  0x0000150b35b54346 in MPIDI_OFI_gpu_progress_task () from /soft/restricted/CNDA/updates/mpich/52.2/mpich-ofi-all-icc-default-pmix-gpu-drop52/lib/libmpi.so.12
out192ReadHang..103:#0  0x00001478d5cc9c30 in pthread_spin_lock@plt () from /opt/cray/libfabric/1.15.2.0/lib64/libfabric.so.1
out192ReadHang..103:#0  0x000014603229e0ed in ofi_cq_readfrom (cq_fid=0x4d25460, buf=<optimized out>, count=8, src_addr=0x0) at prov/util/src/util_cq.c:230
out192ReadHang..103:#0  0x000014b9f43b19b3 in pthread_spin_lock () from /lib64/libpthread.so.0
out192ReadHang..104:#0  ofi_mutex_lock_noop (lock=0x49286a8) at ./include/ofi_lock.h:295
out192ReadHang..104:#0  0x0000149a9598a9b3 in pthread_spin_lock () from /lib64/libpthread.so.0
out192ReadHang..104:#0  0x000014b804b85539 in cxip_cq_eq_progress (eq=0x5e90710, cq=0x5e905f0) at prov/cxi/src/cxip_cq.c:508
out192ReadHang..104:#0  ofi_cq_read (cq_fid=0x58ed630, buf=0x7fff68146540, count=8) at prov/util/src/util_cq.c:286
out192ReadHang..104:#0  0x00001464285c1539 in cxip_cq_eq_progress (eq=0x4b92750, cq=0x4b92630) at prov/cxi/src/cxip_cq.c:508
out192ReadHang..104:#0  0x00001528fa0759b3 in pthread_spin_lock () from /lib64/libpthread.so.0
out192ReadHang..105:#0  cxip_util_cq_progress (util_cq=0x5816460) at prov/cxi/src/cxip_cq.c:560
out192ReadHang..105:#0  cxip_cq_progress (cq=0x5c85630) at prov/cxi/src/cxip_cq.c:545
out192ReadHang..105:#0  0x00001509c72714e7 in cxip_cq_progress (cq=0x4804290) at prov/cxi/src/cxip_cq.c:554
out192ReadHang..106:#0  0x0000150f8fd1bd19 in MPIDU_genq_shmem_queue_dequeue () from /soft/restricted/CNDA/updates/mpich/52.2/mpich-ofi-all-icc-default-pmix-gpu-drop52/lib/libmpi.so.12
out192ReadHang..106:#0  0x000014bf20773919 in MPIDI_progress_test () from /soft/restricted/CNDA/updates/mpich/52.2/mpich-ofi-all-icc-default-pmix-gpu-drop52/lib/libmpi.so.12
out192ReadHang..106:#0  0x00001481f772a9a0 in __tls_get_addr_slow () from /lib64/ld-linux-x86-64.so.2
out192ReadHang..106:#0  0x0000153ab4496915 in MPIDI_progress_test () from /soft/restricted/CNDA/updates/mpich/52.2/mpich-ofi-all-icc-default-pmix-gpu-drop52/lib/libmpi.so.12
out192ReadHang..106:#0  0x00001553d92119b3 in pthread_spin_lock () from /lib64/libpthread.so.0
out192ReadHang..107:#0  0x0000152762f5e9b3 in pthread_spin_lock () from /lib64/libpthread.so.0
out192ReadHang..107:#0  0x00001483c395d274 in ofi_genlock_unlock (lock=0x5589490) at ./include/ofi_lock.h:364
out192ReadHang..107:#0  0x00001532304864e7 in cxip_cq_progress (cq=0x5410460) at prov/cxi/src/cxip_cq.c:554
out192ReadHang..107:#0  cxip_ep_ctrl_progress (ep_obj=0x5ae37f0) at prov/cxi/src/cxip_ctrl.c:374
out192ReadHang..107:#0  0x00001482ce11228a in ofi_cq_readfrom (cq_fid=0x5948290, buf=<optimized out>, count=<optimized out>, src_addr=<optimized out>) at prov/util/src/util_cq.c:282
out192ReadHang..108:#0  0x00001515dad2b9b3 in pthread_spin_lock () from /lib64/libpthread.so.0
out192ReadHang..108:#0  0x00001520cb3bd9b3 in pthread_spin_lock () from /lib64/libpthread.so.0
out192ReadHang..108:#0  0x0000150ff06289b5 in pthread_spin_lock () from /lib64/libpthread.so.0
out192ReadHang..108:#0  0x0000151eb4ef9539 in cxip_cq_eq_progress (eq=0x4aca3b0, cq=0x4aca290) at prov/cxi/src/cxip_cq.c:508
out192ReadHang..109:#0  0x0000146ca368f566 in _dl_update_slotinfo () from /lib64/ld-linux-x86-64.so.2
out192ReadHang..109:#0  0x0000145cd506d9b3 in pthread_spin_lock () from /lib64/libpthread.so.0
out192ReadHang..109:#0  0x0000150f6a5bf9b3 in pthread_spin_lock () from /lib64/libpthread.so.0
out192ReadHang..11:#0  0x0000147b94c080ed in ofi_cq_readfrom (cq_fid=0x5a25b20, buf=<optimized out>, count=8, src_addr=0x0) at prov/util/src/util_cq.c:230
out192ReadHang..110:#0  0x0000149dc25eec30 in pthread_spin_lock@plt () from /opt/cray/libfabric/1.15.2.0/lib64/libfabric.so.1
out192ReadHang..110:#0  0x000014c9c7dc1cce in MPIR_Progress_hook_exec_all () from /soft/restricted/CNDA/updates/mpich/52.2/mpich-ofi-all-icc-default-pmix-gpu-drop52/lib/libmpi.so.12
out192ReadHang..110:#0  0x00001518d72fa9b3 in pthread_spin_lock () from /lib64/libpthread.so.0
out192ReadHang..110:#0  0x0000145be20c353d in MPIDI_progress_test () from /soft/restricted/CNDA/updates/mpich/52.2/mpich-ofi-all-icc-default-pmix-gpu-drop52/lib/libmpi.so.12
out192ReadHang..110:#0  0x00001518d72fa9b3 in pthread_spin_lock () from /lib64/libpthread.so.0
out192ReadHang..110:#0  0x0000145be20c353d in MPIDI_progress_test () from /soft/restricted/CNDA/updates/mpich/52.2/mpich-ofi-all-icc-default-pmix-gpu-drop52/lib/libmpi.so.12
out192ReadHang..111:#0  ofi_genlock_lock (lock=0x45c9660) at ./include/ofi_lock.h:359
out192ReadHang..111:#0  0x0000147f9d6259b3 in pthread_spin_lock () from /lib64/libpthread.so.0
out192ReadHang..111:#0  0x000014faf1c6528a in ofi_cq_readfrom (cq_fid=0x462c250, buf=<optimized out>, count=<optimized out>, src_addr=<optimized out>) at prov/util/src/util_cq.c:282
out192ReadHang..111:#0  0x00001466859449a0 in MPIDI_OFI_gpu_progress_task () from /soft/restricted/CNDA/updates/mpich/52.2/mpich-ofi-all-icc-default-pmix-gpu-drop52/lib/libmpi.so.12
out192ReadHang..112:#0  cxip_cq_progress (cq=0x5712420) at prov/cxi/src/cxip_cq.c:545
out192ReadHang..112:#0  0x0000149e042559b5 in pthread_spin_lock () from /lib64/libpthread.so.0
out192ReadHang..113:#0  0x0000154229929d19 in MPIDU_genq_shmem_queue_dequeue () from /soft/restricted/CNDA/updates/mpich/52.2/mpich-ofi-all-icc-default-pmix-gpu-drop52/lib/libmpi.so.12
out192ReadHang..113:#0  0x0000154391bde9b3 in pthread_spin_lock () from /lib64/libpthread.so.0
out192ReadHang..113:#0  0x0000154cd50149b3 in pthread_spin_lock () from /lib64/libpthread.so.0
out192ReadHang..114:#0  0x00001456782a04e7 in cxip_cq_progress (cq=0x520f420) at prov/cxi/src/cxip_cq.c:554
out192ReadHang..114:#0  0x000014e5f6b7c0ed in ofi_cq_readfrom (cq_fid=0x49c5b00, buf=<optimized out>, count=8, src_addr=0x0) at prov/util/src/util_cq.c:230
out192ReadHang..114:#0  ofi_mutex_lock_noop (lock=0x507bb58) at ./include/ofi_lock.h:295
out192ReadHang..114:#0  0x000014e873c7a0ed in ofi_cq_readfrom (cq_fid=0x4e12250, buf=<optimized out>, count=8, src_addr=0x0) at prov/util/src/util_cq.c:230
out192ReadHang..114:#0  0x00001489c764e0ed in ofi_cq_readfrom (cq_fid=0x5c06290, buf=<optimized out>, count=8, src_addr=0x0) at prov/util/src/util_cq.c:230
out192ReadHang..114:#0  0x000014894e1dd9b3 in pthread_spin_lock () from /lib64/libpthread.so.0
out192ReadHang..115:#0  0x0000147366ba46e8 in _dl_update_slotinfo () from /lib64/ld-linux-x86-64.so.2
out192ReadHang..115:#0  0x00001523097139b3 in pthread_spin_lock () from /lib64/libpthread.so.0
out192ReadHang..115:#0  0x000014d5ff3cc234 in MPIDI_progress_test () from /soft/restricted/CNDA/updates/mpich/52.2/mpich-ofi-all-icc-default-pmix-gpu-drop52/lib/libmpi.so.12
out192ReadHang..115:#0  0x0000151568a4a236 in MPIDI_progress_test () from /soft/restricted/CNDA/updates/mpich/52.2/mpich-ofi-all-icc-default-pmix-gpu-drop52/lib/libmpi.so.12
out192ReadHang..116:#0  0x000014e099b210bf in ofi_cq_readfrom (cq_fid=0x47cf290, buf=0x7ffd19693650, count=8, src_addr=0x0) at prov/util/src/util_cq.c:221
out192ReadHang..116:#0  0x000014721257b4e7 in cxip_cq_progress (cq=0x5d84a60) at prov/cxi/src/cxip_cq.c:554
out192ReadHang..116:#0  0x000014e0591139b5 in pthread_spin_lock () from /lib64/libpthread.so.0
out192ReadHang..117:#0  0x0000150573ad150a in cxip_cq_eq_progress (eq=0x48c3c20, cq=0x48c3b00) at prov/cxi/src/cxip_cq.c:508
out192ReadHang..117:#0  ofi_cq_read (cq_fid=0x43d3b00, buf=0x7fffafe2ee60, count=8) at prov/util/src/util_cq.c:286
out192ReadHang..117:#0  ofi_genlock_unlock (lock=0x483d300) at ./include/ofi_lock.h:364
out192ReadHang..117:#0  0x000014c51ee5a9b3 in pthread_spin_lock () from /lib64/libpthread.so.0
out192ReadHang..117:#0  0x0000150e027f3d2a in MPIDU_genq_shmem_queue_dequeue () from /soft/restricted/CNDA/updates/mpich/52.2/mpich-ofi-all-icc-default-pmix-gpu-drop52/lib/libmpi.so.12
out192ReadHang..117:#0  0x0000151f2d4ae0ed in ofi_cq_readfrom (cq_fid=0x4c98290, buf=<optimized out>, count=8, src_addr=0x0) at prov/util/src/util_cq.c:230
out192ReadHang..118:#0  0x000014d1e28e39b3 in pthread_spin_lock () from /lib64/libpthread.so.0
out192ReadHang..118:#0  0x00001552246ce109 in MPIDI_progress_test () from /soft/restricted/CNDA/updates/mpich/52.2/mpich-ofi-all-icc-default-pmix-gpu-drop52/lib/libmpi.so.12
out192ReadHang..118:#0  ofi_mutex_lock_noop (lock=0x48e24d8) at ./include/ofi_lock.h:295
out192ReadHang..118:#0  0x000014a6c8c95d10 in MPIDU_genq_shmem_queue_dequeue () from /soft/restricted/CNDA/updates/mpich/52.2/mpich-ofi-all-icc-default-pmix-gpu-drop52/lib/libmpi.so.12
out192ReadHang..118:#0  cxip_cq_eq_progress (eq=0x4352920, cq=0x4352800) at prov/cxi/src/cxip_cq.c:535
out192ReadHang..118:#0  0x000014a6c8c95d10 in MPIDU_genq_shmem_queue_dequeue () from /soft/restricted/CNDA/updates/mpich/52.2/mpich-ofi-all-icc-default-pmix-gpu-drop52/lib/libmpi.so.12
out192ReadHang..118:#0  cxip_cq_eq_progress (eq=0x4352920, cq=0x4352800) at prov/cxi/src/cxip_cq.c:535
out192ReadHang..118:#0  0x00001468bc18dd19 in MPIDU_genq_shmem_queue_dequeue () from /soft/restricted/CNDA/updates/mpich/52.2/mpich-ofi-all-icc-default-pmix-gpu-drop52/lib/libmpi.so.12
out192ReadHang..119:#0  0x000014564b3160ed in ofi_cq_readfrom (cq_fid=0x49e95f0, buf=<optimized out>, count=8, src_addr=0x0) at prov/util/src/util_cq.c:230
out192ReadHang..119:#0  0x0000147c53dfd9b3 in pthread_spin_lock () from /lib64/libpthread.so.0
out192ReadHang..119:#0  0x000014cb837cb9b3 in pthread_spin_lock () from /lib64/libpthread.so.0
out192ReadHang..119:#0  ofi_cq_read (cq_fid=0x519a630, buf=0x7ffe6be63ac0, count=8) at prov/util/src/util_cq.c:286
out192ReadHang..119:#0  0x000014965f1889b3 in pthread_spin_lock () from /lib64/libpthread.so.0
out192ReadHang..12:#0  0x000014dd6c0f79b3 in pthread_spin_lock () from /lib64/libpthread.so.0
out192ReadHang..12:#0  0x000014c89cc6aca1 in MPIR_Progress_hook_exec_all () from /soft/restricted/CNDA/updates/mpich/52.2/mpich-ofi-all-icc-default-pmix-gpu-drop52/lib/libmpi.so.12
out192ReadHang..12:#0  0x00001457997e89b3 in pthread_spin_lock () from /lib64/libpthread.so.0
out192ReadHang..12:#0  ofi_cq_read (cq_fid=0x41e9610, buf=0x7ffcb0f42160, count=8) at prov/util/src/util_cq.c:286
out192ReadHang..120:#0  0x0000149836cc99b3 in pthread_spin_lock () from /lib64/libpthread.so.0
out192ReadHang..120:#0  cxip_ep_ctrl_progress (ep_obj=0x402b660) at prov/cxi/src/cxip_ctrl.c:374
out192ReadHang..120:#0  0x0000145f480fb549 in MPIDI_progress_test () from /soft/restricted/CNDA/updates/mpich/52.2/mpich-ofi-all-icc-default-pmix-gpu-drop52/lib/libmpi.so.12
out192ReadHang..120:#0  0x0000151a30fe4919 in MPIDI_progress_test () from /soft/restricted/CNDA/updates/mpich/52.2/mpich-ofi-all-icc-default-pmix-gpu-drop52/lib/libmpi.so.12
out192ReadHang..121:#0  ofi_mutex_lock_noop (lock=0x40fc6a8) at ./include/ofi_lock.h:295
out192ReadHang..121:#0  ofi_genlock_lock (lock=0x5ae0660) at ./include/ofi_lock.h:359
out192ReadHang..121:#0  0x000014f9853c2539 in cxip_cq_eq_progress (eq=0x507a580, cq=0x507a460) at prov/cxi/src/cxip_cq.c:508
out192ReadHang..121:#0  cxip_cq_eq_progress (eq=0x445a710, cq=0x445a5f0) at prov/cxi/src/cxip_cq.c:535
out192ReadHang..122:#0  0x000014559bf929b3 in pthread_spin_lock () from /lib64/libpthread.so.0
out192ReadHang..122:#0  0x000014643c6bc0ed in ofi_cq_readfrom (cq_fid=0x4523460, buf=<optimized out>, count=8, src_addr=0x0) at prov/util/src/util_cq.c:230
out192ReadHang..123:#0  0x000014e5bb4d2512 in cxip_cq_eq_progress (eq=0x4a7cbe0, cq=0x4a7cac0) at prov/cxi/src/cxip_cq.c:508
out192ReadHang..123:#0  0x000014fdf760a543 in cxi_eq_peek_event (eq=0x5430a58) at /usr/include/cxi_prov_hw.h:1537
out192ReadHang..123:#0  0x000014b1ee9a4539 in MPIDI_progress_test () from /soft/restricted/CNDA/updates/mpich/52.2/mpich-ofi-all-icc-default-pmix-gpu-drop52/lib/libmpi.so.12
out192ReadHang..123:#0  0x00001497fc97f284 in MPIDI_progress_test () from /soft/restricted/CNDA/updates/mpich/52.2/mpich-ofi-all-icc-default-pmix-gpu-drop52/lib/libmpi.so.12
out192ReadHang..124:#0  0x0000150054cd10f1 in ofi_cq_readfrom (cq_fid=0x49cc460, buf=<optimized out>, count=8, src_addr=0x0) at prov/util/src/util_cq.c:230
out192ReadHang..124:#0  0x000014b9823040ed in ofi_cq_readfrom (cq_fid=0x5ba4630, buf=<optimized out>, count=8, src_addr=0x0) at prov/util/src/util_cq.c:230
out192ReadHang..124:#0  0x000014ae3df3b9b3 in pthread_spin_lock () from /lib64/libpthread.so.0
out192ReadHang..125:#0  0x000014a674ae3b12 in cxi_eq_peek_event (eq=0x5bd8c58) at /usr/include/cxi_prov_hw.h:1540
out192ReadHang..125:#0  cxip_ep_ctrl_progress (ep_obj=0x58d4e20) at prov/cxi/src/cxip_ctrl.c:374
out192ReadHang..125:#0  ofi_mutex_lock_noop (lock=0x47b7b58) at ./include/ofi_lock.h:295
out192ReadHang..125:#0  0x000014c08b0609b3 in pthread_spin_lock () from /lib64/libpthread.so.0
out192ReadHang..125:#0  0x00001458c92e728a in ofi_cq_readfrom (cq_fid=0x3fdd460, buf=<optimized out>, count=<optimized out>, src_addr=<optimized out>) at prov/util/src/util_cq.c:282
out192ReadHang..125:#0  0x000014ec3e0fc9b3 in pthread_spin_lock () from /lib64/libpthread.so.0
out192ReadHang..126:#0  0x00001542132729b3 in pthread_spin_lock () from /lib64/libpthread.so.0
out192ReadHang..125:#0  0x000014ec3e0fc9b3 in pthread_spin_lock () from /lib64/libpthread.so.0
out192ReadHang..126:#0  0x00001542132729b3 in pthread_spin_lock () from /lib64/libpthread.so.0
out192ReadHang..126:#0  0x000014807f36e0bf in ofi_cq_readfrom (cq_fid=0x4ab8420, buf=0x7ffe9110a7e0, count=8, src_addr=0x0) at prov/util/src/util_cq.c:221
out192ReadHang..126:#0  0x000014bc0410d0ed in ofi_cq_readfrom (cq_fid=0x4c4a250, buf=<optimized out>, count=8, src_addr=0x0) at prov/util/src/util_cq.c:230
out192ReadHang..126:#0  0x0000152dcfe4d0ed in ofi_cq_readfrom (cq_fid=0x4f89290, buf=<optimized out>, count=8, src_addr=0x0) at prov/util/src/util_cq.c:230
out192ReadHang..126:#0  0x0000151fa6123d19 in MPIDU_genq_shmem_queue_dequeue () from /soft/restricted/CNDA/updates/mpich/52.2/mpich-ofi-all-icc-default-pmix-gpu-drop52/lib/libmpi.so.12
out192ReadHang..127:#0  0x000014590b29f0ed in ofi_cq_readfrom (cq_fid=0x46f1250, buf=<optimized out>, count=8, src_addr=0x0) at prov/util/src/util_cq.c:230
out192ReadHang..127:#0  0x00001461aaf4809b in __tls_get_addr () from /lib64/ld-linux-x86-64.so.2
out192ReadHang..127:#0  0x000014d96066f234 in MPIDI_progress_test () from /soft/restricted/CNDA/updates/mpich/52.2/mpich-ofi-all-icc-default-pmix-gpu-drop52/lib/libmpi.so.12
out192ReadHang..127:#0  0x000014c7812ce0ed in ofi_cq_readfrom (cq_fid=0x41f6ae0, buf=<optimized out>, count=8, src_addr=0x0) at prov/util/src/util_cq.c:230
out192ReadHang..127:#0  0x0000149b6f0399b3 in pthread_spin_lock () from /lib64/libpthread.so.0
out192ReadHang..128:#0  0x000014788a0910ed in ofi_cq_readfrom (cq_fid=0x5af1290, buf=<optimized out>, count=8, src_addr=0x0) at prov/util/src/util_cq.c:230
out192ReadHang..128:#0  0x0000153a352b60ed in ofi_cq_readfrom (cq_fid=0x58d8420, buf=<optimized out>, count=8, src_addr=0x0) at prov/util/src/util_cq.c:230
out192ReadHang..128:#0  0x000014bb79165b12 in cxi_eq_peek_event (eq=0x4197de8) at /usr/include/cxi_prov_hw.h:1540
out192ReadHang..128:#0  cxip_util_cq_progress (util_cq=0x3e58aa0) at prov/cxi/src/cxip_cq.c:560
out192ReadHang..128:#0  0x0000152ea5d70301 in MPIDI_progress_test () from /soft/restricted/CNDA/updates/mpich/52.2/mpich-ofi-all-icc-default-pmix-gpu-drop52/lib/libmpi.so.12
out192ReadHang..128:#0  cxi_eq_peek_event (eq=0x43eb948) at /usr/include/cxi_prov_hw.h:1532
out192ReadHang..129:#0  0x00001493da6640bf in ofi_cq_readfrom (cq_fid=0x3e1aac0, buf=0x7fff5e0fb9c0, count=8, src_addr=0x0) at prov/util/src/util_cq.c:221
out192ReadHang..129:#0  0x0000149f09d6b9b3 in pthread_spin_lock () from /lib64/libpthread.so.0
out192ReadHang..129:#0  0x00001552d6b419b3 in pthread_spin_lock () from /lib64/libpthread.so.0
out192ReadHang..129:#0  0x0000151613b123e0 in ofi_mutex_unlock_noop () at src/common.c:996
out192ReadHang..13:#0  0x000014c05360f9b3 in pthread_spin_lock () from /lib64/libpthread.so.0
out192ReadHang..13:#0  0x0000147d213680fd in ofi_genlock_unlock (lock=0x4210660) at ./include/ofi_lock.h:364
out192ReadHang..13:#0  0x0000152a1dd95506 in cxip_cq_eq_progress (eq=0x54f23b0, cq=0x54f2290) at prov/cxi/src/cxip_cq.c:508
out192ReadHang..13:#0  0x00001467aed7b0ed in ofi_cq_readfrom (cq_fid=0x4cbb250, buf=<optimized out>, count=8, src_addr=0x0) at prov/util/src/util_cq.c:230
out192ReadHang..130:#0  cxip_ep_ctrl_eq_progress (ep_obj=0x5669660, ctrl_evtq=0x5634d78, tx_evtq=false, ep_obj_locked=false) at prov/cxi/src/cxip_ctrl.c:320
out192ReadHang..130:#0  0x000014e57a5deb8e in cxip_ep_ctrl_eq_progress (ep_obj=<optimized out>, ctrl_evtq=<optimized out>, tx_evtq=false, ep_obj_locked=<optimized out>) at prov/cxi/src/cxip_ctrl.c:355
out192ReadHang..130:#0  0x000015160ec5b0fd in ofi_genlock_unlock (lock=0x48882c0) at ./include/ofi_lock.h:364
out192ReadHang..130:#0  0x0000146e742e5c30 in pthread_spin_lock@plt () from /opt/cray/libfabric/1.15.2.0/lib64/libfabric.so.1
out192ReadHang..131:#0  cxip_ep_ctrl_progress (ep_obj=0x5aa5490) at prov/cxi/src/cxip_ctrl.c:372
out192ReadHang..131:#0  cxip_util_cq_progress (util_cq=0x581c250) at prov/cxi/src/cxip_cq.c:566
out192ReadHang..131:#0  0x000014af9e537915 in MPIDI_progress_test () from /soft/restricted/CNDA/updates/mpich/52.2/mpich-ofi-all-icc-default-pmix-gpu-drop52/lib/libmpi.so.12
out192ReadHang..131:#0  cxip_util_cq_progress (util_cq=0x581c250) at prov/cxi/src/cxip_cq.c:566
out192ReadHang..131:#0  0x000014af9e537915 in MPIDI_progress_test () from /soft/restricted/CNDA/updates/mpich/52.2/mpich-ofi-all-icc-default-pmix-gpu-drop52/lib/libmpi.so.12
out192ReadHang..131:#0  0x000014725d5070ed in ofi_cq_readfrom (cq_fid=0x5a793a0, buf=<optimized out>, count=8, src_addr=0x0) at prov/util/src/util_cq.c:230
out192ReadHang..132:#0  0x00001522fe65ab8e in cxip_ep_ctrl_eq_progress (ep_obj=<optimized out>, ctrl_evtq=<optimized out>, tx_evtq=false, ep_obj_locked=<optimized out>) at prov/cxi/src/cxip_ctrl.c:355
out192ReadHang..132:#0  0x0000148ea6706506 in cxip_cq_eq_progress (eq=0x43ba540, cq=0x43ba420) at prov/cxi/src/cxip_cq.c:508
out192ReadHang..132:#0  0x0000151bbd27e9b3 in pthread_spin_lock () from /lib64/libpthread.so.0
out192ReadHang..132:#0  0x00001461fa023b12 in cxi_eq_peek_event (eq=0x460d9a8) at /usr/include/cxi_prov_hw.h:1540
out192ReadHang..133:#0  0x00001499d046309b in __tls_get_addr () from /lib64/ld-linux-x86-64.so.2
out192ReadHang..133:#0  0x00001463b556e9b3 in pthread_spin_lock () from /lib64/libpthread.so.0
out192ReadHang..133:#0  cxip_ep_ctrl_progress (ep_obj=0x5956660) at prov/cxi/src/cxip_ctrl.c:374
out192ReadHang..133:#0  0x0000148aa41da6e9 in _dl_update_slotinfo () from /lib64/ld-linux-x86-64.so.2
out192ReadHang..134:#0  0x000014b4f89f4d2a in MPIDU_genq_shmem_queue_dequeue () from /soft/restricted/CNDA/updates/mpich/52.2/mpich-ofi-all-icc-default-pmix-gpu-drop52/lib/libmpi.so.12
out192ReadHang..134:#0  0x00001553cb443d1f in MPIDU_genq_shmem_queue_dequeue () from /soft/restricted/CNDA/updates/mpich/52.2/mpich-ofi-all-icc-default-pmix-gpu-drop52/lib/libmpi.so.12
out192ReadHang..134:#0  0x00001495091089b3 in pthread_spin_lock () from /lib64/libpthread.so.0
out192ReadHang..134:#0  0x000014f919324af4 in cxip_ep_ctrl_eq_progress (ep_obj=0x468c620, ctrl_evtq=0x4657bf8, tx_evtq=true, ep_obj_locked=false) at prov/cxi/src/cxip_ctrl.c:320
out192ReadHang..134:#0  0x000014a449eadd2a in MPIDU_genq_shmem_queue_dequeue () from /soft/restricted/CNDA/updates/mpich/52.2/mpich-ofi-all-icc-default-pmix-gpu-drop52/lib/libmpi.so.12
out192ReadHang..134:#0  0x000014b30d5a7274 in ofi_genlock_unlock (lock=0x559e490) at ./include/ofi_lock.h:364
out192ReadHang..135:#0  0x0000146b816d4919 in MPIDI_progress_test () from /soft/restricted/CNDA/updates/mpich/52.2/mpich-ofi-all-icc-default-pmix-gpu-drop52/lib/libmpi.so.12
out192ReadHang..135:#0  0x0000149bc645d0bf in ofi_cq_readfrom (cq_fid=0x5bfe460, buf=0x7ffd30b561a0, count=8, src_addr=0x0) at prov/util/src/util_cq.c:221
out192ReadHang..135:#0  cxi_eq_peek_event (eq=0x54f7aa8) at /usr/include/cxi_prov_hw.h:1532
out192ReadHang..135:#0  0x000014f845c67fec in MPIDI_progress_test () from /soft/restricted/CNDA/updates/mpich/52.2/mpich-ofi-all-icc-default-pmix-gpu-drop52/lib/libmpi.so.12
out192ReadHang..135:#0  0x000014d66fc9e9b3 in pthread_spin_lock () from /lib64/libpthread.so.0
out192ReadHang..136:#0  0x0000150caa4159b3 in pthread_spin_lock () from /lib64/libpthread.so.0
out192ReadHang..136:#0  cxip_cq_progress (cq=0x5162460) at prov/cxi/src/cxip_cq.c:545
out192ReadHang..136:#0  ofi_mutex_lock_noop (lock=0x448f308) at ./include/ofi_lock.h:295
out192ReadHang..136:#0  cxip_cq_progress (cq=0x4f18250) at prov/cxi/src/cxip_cq.c:545
out192ReadHang..137:#0  0x00001525f13e6270 in MPIDI_progress_test () from /soft/restricted/CNDA/updates/mpich/52.2/mpich-ofi-all-icc-default-pmix-gpu-drop52/lib/libmpi.so.12
out192ReadHang..137:#0  0x00001485304f29b3 in pthread_spin_lock () from /lib64/libpthread.so.0
out192ReadHang..137:#0  0x000014f8069c3506 in cxip_cq_eq_progress (eq=0x4d6b710, cq=0x4d6b5f0) at prov/cxi/src/cxip_cq.c:508
out192ReadHang..137:#0  0x000014c55fbe09b3 in pthread_spin_lock () from /lib64/libpthread.so.0
out192ReadHang..138:#0  0x0000145f0b07df7c in MPIDI_progress_test () from /soft/restricted/CNDA/updates/mpich/52.2/mpich-ofi-all-icc-default-pmix-gpu-drop52/lib/libmpi.so.12
out192ReadHang..138:#0  0x0000148be6cd7284 in MPIDI_progress_test () from /soft/restricted/CNDA/updates/mpich/52.2/mpich-ofi-all-icc-default-pmix-gpu-drop52/lib/libmpi.so.12
out192ReadHang..138:#0  0x000014820c7669b3 in pthread_spin_lock () from /lib64/libpthread.so.0
out192ReadHang..139:#0  ofi_mutex_lock_noop (lock=0x4f54308) at ./include/ofi_lock.h:295
out192ReadHang..139:#0  0x00001507266179b3 in pthread_spin_lock () from /lib64/libpthread.so.0
out192ReadHang..139:#0  ofi_mutex_lock_noop (lock=0x4f54308) at ./include/ofi_lock.h:295
out192ReadHang..139:#0  0x00001507266179b3 in pthread_spin_lock () from /lib64/libpthread.so.0
out192ReadHang..139:#0  ofi_mutex_lock_noop (lock=0x51954d8) at ./include/ofi_lock.h:295
out192ReadHang..139:#0  0x000014ebddbea9b3 in pthread_spin_lock () from /lib64/libpthread.so.0
out192ReadHang..139:#0  0x000014584c42b9b3 in pthread_spin_lock () from /lib64/libpthread.so.0
out192ReadHang..139:#0  ofi_cq_read (cq_fid=0x51f7420, buf=0x7ffc6d748f40, count=8) at prov/util/src/util_cq.c:286
out192ReadHang..14:#0  0x00001504aef730ed in ofi_cq_readfrom (cq_fid=0x521d5b0, buf=<optimized out>, count=8, src_addr=0x0) at prov/util/src/util_cq.c:230
out192ReadHang..14:#0  0x000014abcff8b9b3 in pthread_spin_lock () from /lib64/libpthread.so.0
out192ReadHang..14:#0  0x000014a9f79359b3 in pthread_spin_lock () from /lib64/libpthread.so.0
out192ReadHang..140:#0  0x000014601de458a5 in MPIDI_progress_test () from /soft/restricted/CNDA/updates/mpich/52.2/mpich-ofi-all-icc-default-pmix-gpu-drop52/lib/libmpi.so.12
out192ReadHang..140:#0  0x000014d2a0981506 in cxip_cq_eq_progress (eq=0x402f370, cq=0x402f250) at prov/cxi/src/cxip_cq.c:508
out192ReadHang..140:#0  0x0000148847010d19 in MPIDU_genq_shmem_queue_dequeue () from /soft/restricted/CNDA/updates/mpich/52.2/mpich-ofi-all-icc-default-pmix-gpu-drop52/lib/libmpi.so.12
out192ReadHang..140:#0  0x0000151cca47baf2 in cxip_ep_ctrl_eq_progress (ep_obj=0x4a36e20, ctrl_evtq=0x4a1e3b8, tx_evtq=true, ep_obj_locked=false) at prov/cxi/src/cxip_ctrl.c:320
out192ReadHang..140:#0  0x00001517b777e0ed in ofi_cq_readfrom (cq_fid=0x4883290, buf=<optimized out>, count=8, src_addr=0x0) at prov/util/src/util_cq.c:230
out192ReadHang..141:#0  cxip_util_cq_progress (util_cq=0x49f5250) at prov/cxi/src/cxip_cq.c:566
out192ReadHang..141:#0  0x000014c7e60772c1 in MPIDI_progress_test () from /soft/restricted/CNDA/updates/mpich/52.2/mpich-ofi-all-icc-default-pmix-gpu-drop52/lib/libmpi.so.12
out192ReadHang..141:#0  0x00001553157600ed in ofi_cq_readfrom (cq_fid=0x52315f0, buf=<optimized out>, count=8, src_addr=0x0) at prov/util/src/util_cq.c:230
out192ReadHang..141:#0  0x000014854741ad19 in MPIDU_genq_shmem_queue_dequeue () from /soft/restricted/CNDA/updates/mpich/52.2/mpich-ofi-all-icc-default-pmix-gpu-drop52/lib/libmpi.so.12
out192ReadHang..142:#0  0x000014c9e471d563 in _dl_update_slotinfo () from /lib64/ld-linux-x86-64.so.2
out192ReadHang..142:#0  cxip_ep_ctrl_progress (ep_obj=0x5155490) at prov/cxi/src/cxip_ctrl.c:374
out192ReadHang..143:#0  0x000014b92e597b90 in cxip_ep_ctrl_eq_progress (ep_obj=<optimized out>, ctrl_evtq=<optimized out>, tx_evtq=true, ep_obj_locked=<optimized out>) at prov/cxi/src/cxip_ctrl.c:355
out192ReadHang..143:#0  0x000014d6cbadd0ed in ofi_cq_readfrom (cq_fid=0x59f9420, buf=<optimized out>, count=8, src_addr=0x0) at prov/util/src/util_cq.c:230
out192ReadHang..143:#0  0x00001495a5b2c28a in ofi_cq_readfrom (cq_fid=0x5ab1290, buf=<optimized out>, count=<optimized out>, src_addr=<optimized out>) at prov/util/src/util_cq.c:282
out192ReadHang..144:#0  0x000014a0ed85477b in update_get_addr () from /lib64/ld-linux-x86-64.so.2
out192ReadHang..144:#0  ofi_genlock_lock (lock=0x555c300) at ./include/ofi_lock.h:359
out192ReadHang..144:#0  0x00001492108854e7 in cxip_cq_progress (cq=0x47f2290) at prov/cxi/src/cxip_cq.c:554
out192ReadHang..145:#0  0x0000148d9aadb274 in ofi_genlock_unlock (lock=0x42bb2c0) at ./include/ofi_lock.h:364
out192ReadHang..145:#0  0x00001490925b9da8 in MPIDU_genq_shmem_queue_dequeue () from /soft/restricted/CNDA/updates/mpich/52.2/mpich-ofi-all-icc-default-pmix-gpu-drop52/lib/libmpi.so.12
out192ReadHang..146:#0  cxip_cq_eq_progress (eq=0x52c6710, cq=0x52c65f0) at prov/cxi/src/cxip_cq.c:535
out192ReadHang..146:#0  0x00001537bf0cac30 in pthread_spin_lock@plt () from /opt/cray/libfabric/1.15.2.0/lib64/libfabric.so.1
out192ReadHang..146:#0  0x0000151f8dfbe9b3 in pthread_spin_lock () from /lib64/libpthread.so.0
out192ReadHang..147:#0  0x0000154842f493e0 in ofi_mutex_unlock_noop () at src/common.c:996
out192ReadHang..147:#0  0x000014b3381cf230 in MPIDI_progress_test () from /soft/restricted/CNDA/updates/mpich/52.2/mpich-ofi-all-icc-default-pmix-gpu-drop52/lib/libmpi.so.12
out192ReadHang..147:#0  0x00001537706640ed in ofi_cq_readfrom (cq_fid=0x5386b20, buf=<optimized out>, count=8, src_addr=0x0) at prov/util/src/util_cq.c:230
out192ReadHang..147:#0  0x000014b3381cf230 in MPIDI_progress_test () from /soft/restricted/CNDA/updates/mpich/52.2/mpich-ofi-all-icc-default-pmix-gpu-drop52/lib/libmpi.so.12
out192ReadHang..147:#0  0x00001537706640ed in ofi_cq_readfrom (cq_fid=0x5386b20, buf=<optimized out>, count=8, src_addr=0x0) at prov/util/src/util_cq.c:230
out192ReadHang..147:#0  cxip_ep_ctrl_eq_progress (ep_obj=<optimized out>, ctrl_evtq=<optimized out>, tx_evtq=true, ep_obj_locked=<optimized out>) at prov/cxi/src/cxip_ctrl.c:355
out192ReadHang..147:#0  0x0000145fb2dbd9b3 in pthread_spin_lock () from /lib64/libpthread.so.0
out192ReadHang..147:#0  0x0000151c8ca649b3 in pthread_spin_lock () from /lib64/libpthread.so.0
out192ReadHang..147:#0  cxip_util_cq_progress (util_cq=0x4700290) at prov/cxi/src/cxip_cq.c:566
out192ReadHang..147:#0  0x000014dcbd38f919 in MPIDI_progress_test () from /soft/restricted/CNDA/updates/mpich/52.2/mpich-ofi-all-icc-default-pmix-gpu-drop52/lib/libmpi.so.12
out192ReadHang..148:#0  0x00001525e3c75919 in MPIDI_progress_test () from /soft/restricted/CNDA/updates/mpich/52.2/mpich-ofi-all-icc-default-pmix-gpu-drop52/lib/libmpi.so.12
out192ReadHang..148:#0  0x0000154e907c2506 in cxip_cq_eq_progress (eq=0x5535710, cq=0x55355f0) at prov/cxi/src/cxip_cq.c:508
out192ReadHang..148:#0  ofi_cq_read (cq_fid=0x560cb40, buf=0x7fff4a0ce600, count=8) at prov/util/src/util_cq.c:286
out192ReadHang..148:#0  0x0000145bc14469b3 in pthread_spin_lock () from /lib64/libpthread.so.0
out192ReadHang..148:#0  cxip_ep_ctrl_eq_progress (ep_obj=0x47a9490, ctrl_evtq=0x4774338, tx_evtq=false, ep_obj_locked=false) at prov/cxi/src/cxip_ctrl.c:320
out192ReadHang..148:#0  0x00001540e99bf0ed in ofi_cq_readfrom (cq_fid=0x45a7b20, buf=<optimized out>, count=8, src_addr=0x0) at prov/util/src/util_cq.c:230
out192ReadHang..149:#0  0x000015076feb7506 in cxip_cq_eq_progress (eq=0x5dc8be0, cq=0x5dc8ac0) at prov/cxi/src/cxip_cq.c:508
out192ReadHang..149:#0  0x000014fc837bf0ed in ofi_cq_readfrom (cq_fid=0x52f4460, buf=<optimized out>, count=8, src_addr=0x0) at prov/util/src/util_cq.c:230
out192ReadHang..149:#0  0x00001521f95079b3 in pthread_spin_lock () from /lib64/libpthread.so.0
out192ReadHang..15:#0  0x000014bdf437b9b3 in pthread_spin_lock () from /lib64/libpthread.so.0
out192ReadHang..15:#0  0x00001522c0d26539 in cxip_cq_eq_progress (eq=0x5dd95f0, cq=0x5dd94d0) at prov/cxi/src/cxip_cq.c:508
out192ReadHang..150:#0  0x000014ffc7fe8175 in MPIDI_progress_test () from /soft/restricted/CNDA/updates/mpich/52.2/mpich-ofi-all-icc-default-pmix-gpu-drop52/lib/libmpi.so.12
out192ReadHang..150:#0  0x00001514ff30d9b3 in pthread_spin_lock () from /lib64/libpthread.so.0
out192ReadHang..150:#0  0x000014f9650f2919 in MPIDI_progress_test () from /soft/restricted/CNDA/updates/mpich/52.2/mpich-ofi-all-icc-default-pmix-gpu-drop52/lib/libmpi.so.12
out192ReadHang..151:#0  0x000014f2c85f99b3 in pthread_spin_lock () from /lib64/libpthread.so.0
out192ReadHang..151:#0  0x0000150652a20da8 in MPIDU_genq_shmem_queue_dequeue () from /soft/restricted/CNDA/updates/mpich/52.2/mpich-ofi-all-icc-default-pmix-gpu-drop52/lib/libmpi.so.12
out192ReadHang..151:#0  0x0000149a60fbbaf4 in cxip_ep_ctrl_eq_progress (ep_obj=0x5709de0, ctrl_evtq=0x56f1378, tx_evtq=true, ep_obj_locked=false) at prov/cxi/src/cxip_ctrl.c:320
out192ReadHang..151:#0  0x000014cdafa660ed in ofi_cq_readfrom (cq_fid=0x50a3ae0, buf=<optimized out>, count=8, src_addr=0x0) at prov/util/src/util_cq.c:230
out192ReadHang..151:#0  0x000015092ec909b3 in pthread_spin_lock () from /lib64/libpthread.so.0
out192ReadHang..152:#0  0x000014c7f60d8517 in cxip_cq_eq_progress (eq=0x53903b0, cq=0x5390290) at prov/cxi/src/cxip_cq.c:508
out192ReadHang..152:#0  0x00001526ae2886e2 in _dl_update_slotinfo () from /lib64/ld-linux-x86-64.so.2
out192ReadHang..152:#0  0x0000153b51553301 in MPIDI_progress_test () from /soft/restricted/CNDA/updates/mpich/52.2/mpich-ofi-all-icc-default-pmix-gpu-drop52/lib/libmpi.so.12
out192ReadHang..152:#0  ofi_cq_read (cq_fid=0x46bc3a0, buf=0x7ffe05a57400, count=8) at prov/util/src/util_cq.c:286
out192ReadHang..152:#0  0x000014906d3800ed in ofi_cq_readfrom (cq_fid=0x4ff0420, buf=<optimized out>, count=8, src_addr=0x0) at prov/util/src/util_cq.c:230
KennethEJansen commented 8 months ago

getting it down to a page of what I perceive to be the most likely suspects (and you can see who I have taken my eye off with the grep -v list)

kjansen@aurora-uan-0009:~> grep -v pthread 809DoingWhat.log |grep -v ofi_cq|grep -v MPIDI_progress_test |grep -v cxip_ep_ctrl_progress |grep -v cxi_eq_peek_event |grep -v cxip |grep -v noop |grep -v lib64 |grep -v ofi_ |grep -v MPIDU_genq_shmem_queue_dequeue |grep -v MPIDI_OFI_gpu_progress_task
out192ReadHang..110:#0  0x000014c9c7dc1cce in MPIR_Progress_hook_exec_all () from /soft/restricted/CNDA/updates/mpich/52.2/mpich-ofi-all-icc-default-pmix-gpu-drop52/lib/libmpi.so.12
out192ReadHang..12:#0  0x000014c89cc6aca1 in MPIR_Progress_hook_exec_all () from /soft/restricted/CNDA/updates/mpich/52.2/mpich-ofi-all-icc-default-pmix-gpu-drop52/lib/libmpi.so.12
out192ReadHang..153:#0  0x0000148f14a3cc90 in MPIR_Progress_hook_exec_all () from /soft/restricted/CNDA/updates/mpich/52.2/mpich-ofi-all-icc-default-pmix-gpu-drop52/lib/libmpi.so.12
out192ReadHang..175:#0  0x000014fc86d2eca1 in MPIR_Progress_hook_exec_all () from /soft/restricted/CNDA/updates/mpich/52.2/mpich-ofi-all-icc-default-pmix-gpu-drop52/lib/libmpi.so.12
out192ReadHang..176:#0  0x000014d58561abb0 in __tls_get_addr@plt () from /soft/restricted/CNDA/updates/mpich/52.2/mpich-ofi-all-icc-default-pmix-gpu-drop52/lib/libmpi.so.12
out192ReadHang..180:#0  0x000014685f2c9cce in MPIR_Progress_hook_exec_all () from /soft/restricted/CNDA/updates/mpich/52.2/mpich-ofi-all-icc-default-pmix-gpu-drop52/lib/libmpi.so.12
out192ReadHang..24:#0  0x000014dae7854b34 in MPIDI_POSIX_eager_recv_begin () from /soft/restricted/CNDA/updates/mpich/52.2/mpich-ofi-all-icc-default-pmix-gpu-drop52/lib/libmpi.so.12
out192ReadHang..38:#0  0x000014ca202b6b49 in MPIDI_POSIX_eager_recv_begin () from /soft/restricted/CNDA/updates/mpich/52.2/mpich-ofi-all-icc-default-pmix-gpu-drop52/lib/libmpi.so.12
out192ReadHang..44:#0  0x000014d74b4efcce in MPIR_Progress_hook_exec_all () from /soft/restricted/CNDA/updates/mpich/52.2/mpich-ofi-all-icc-default-pmix-gpu-drop52/lib/libmpi.so.12
out192ReadHang..46:#0  0x0000151c2f05dc90 in MPIR_Progress_hook_exec_all () from /soft/restricted/CNDA/updates/mpich/52.2/mpich-ofi-all-icc-default-pmix-gpu-drop52/lib/libmpi.so.12
out192ReadHang..47:#0  0x000015276af2ab34 in MPIDI_POSIX_eager_recv_begin () from /soft/restricted/CNDA/updates/mpich/52.2/mpich-ofi-all-icc-default-pmix-gpu-drop52/lib/libmpi.so.12
out192ReadHang..50:#0  0x000014cbc3d84ad0 in MPIDI_POSIX_eager_recv_begin () from /soft/restricted/CNDA/updates/mpich/52.2/mpich-ofi-all-icc-default-pmix-gpu-drop52/lib/libmpi.so.12
out192ReadHang..51:#0  0x000014d545d91b53 in MPIDI_POSIX_eager_recv_begin () from /soft/restricted/CNDA/updates/mpich/52.2/mpich-ofi-all-icc-default-pmix-gpu-drop52/lib/libmpi.so.12
out192ReadHang..55:#0  0x00001501093b2bb0 in __tls_get_addr@plt () from /soft/restricted/CNDA/updates/mpich/52.2/mpich-ofi-all-icc-default-pmix-gpu-drop52/lib/libmpi.so.12
out192ReadHang..58:#0  0x000014f9492dabb0 in __tls_get_addr@plt () from /soft/restricted/CNDA/updates/mpich/52.2/mpich-ofi-all-icc-default-pmix-gpu-drop52/lib/libmpi.so.12
out192ReadHang..73:#0  0x000014fd5f5e1cce in MPIR_Progress_hook_exec_all () from /soft/restricted/CNDA/updates/mpich/52.2/mpich-ofi-all-icc-default-pmix-gpu-drop52/lib/libmpi.so.12
out192ReadHang..85:#0  0x0000153b53336b30 in MPIDI_POSIX_eager_recv_begin () from /soft/restricted/CNDA/updates/mpich/52.2/mpich-ofi-all-icc-default-pmix-gpu-drop52/lib/libmpi.so.12
KennethEJansen commented 8 months ago

tls_get_addr has #2 0x0000150109daceda in MPIR_Allreduce

As does MPIDI_POSIX_eager_recv_begin and MPIR_Progress_hook_exec_all

so perhaps that means one of the grep -v terms is the villain.

I don't know these functions so giving up to let more knowledgable people sift the haystack for the needle.

jrwrigh commented 8 months ago

The shmem and pthreads parts in the backtraces stick out to me right now. That seems like some weird race condition to me, but within some kind of multithreading scheme, not rank-to-rank MPI. I wonder if it's possible to turn off multithreading or the shared memory parallelism. Although I'm not sure what performance hit that will cause.

KennethEJansen commented 8 months ago

96 nodes were able to read that same file, run 1000 steps, and wrote a new one (297GB in 12.4 seconds according to PETSc VecView timers).

Currently running a second attempt at 192 nodes to try to read the file 96 nodes just wrote. It looks like it succeeded this time (past all CGNS reads I think). Not sure if this is because CGNS really wrote the file this time (last time it was a cp of a CGNS written file within a directory changed from 16 (what it was written from CGNS in) to -1 or if we are just into the limits of reliability of GGNS + Lustre and encountering the "locking and contention" issues that motivated subfiling in papers

KennethEJansen commented 8 months ago

But we won't get write performance numbers from that run because....

ping failed on x4217c6s6b0n0: No reply from x4309c7s1b0n0.hostmgmt2309.cm.aurora.alcf.anl.gov after 97s

KennethEJansen commented 8 months ago

The shmem and pthreads parts in the backtraces stick out to me right now. That seems like some weird race condition to me, but within some kind of multithreading scheme, not rank-to-rank MPI. I wonder if it's possible to turn off multithreading or the shared memory parallelism. Although I'm not sure what performance hit that will cause.

This raises an interesting question? My Aurora runs have been with 12 ranks per node because the flow solve is entirely on the GPU and the CPUs are "assisting" that phase but they are of course primary for IO and problem setup.

When one of those 12 CPU processes on a given node calls CGNS and on to HDF5 is it trying to use threads to gain more parallelism in the read/write or is it sticking to one process per MPI process?

A related question, is it relevant to debug this with the solver in CPU mode (not taking many steps because it will be slower) and use the full 104 Saphire Rapids cores? This would get us into interesting (failing???) MPI process counts with far fewer nodes as there are 8.67 processors for every "tile" (104 vs 12) so we can get to 10k with 96 nodes or 20k with 192. That said, earlier when we were debugging, I was getting hangs when I tried to use 48 processes per node on this problem. The nodes have a TB of memory so I don't think I was exhausting that.

jrwrigh commented 8 months ago

Back to the original error in this ticket, just documenting a brief code dive:

mismatch in number of children and child IDs read

comes from src/cgns_internals.c:cgi_get_nodes, which is called from cg_open to search for the CGNSLibraryVersion_t node. The mismatch that it's talking about the difference between src/cgns_io.c:cgio_number_children (which counts the number of child nodes root node) and src/cgns_io.c:cgio_children_ids (which gets the actual id numbers for those children).

Both cgio_number_children and cgio_children_ids call H5Literate2 (key word there is "iterate", not "literate"), which simply loops over child nodes of the HDF5 given to it and runs a function on it. I'm not sure how these could disagree with each other frankly. But it'd be interesting to augment the error message to see what it thinks those children numbers are and compare them with the actual file (which iirc, should have only 2 or 3).

KennethEJansen commented 8 months ago

Second attempt at 192 with 96 written input revives our original error for this thread

[939]PETSC ERROR: CGNS error 1 mismatch in number of children and child IDs read

kjansen@aurora-uan-0010:/lus/gecko/projects/PHASTA_aesp_CNDA/BumpQ2> grep "child IDs" JZ192Nodes1215_240108.o622110 |wc
     24     336    1944

so 2 nodes likely found a bad Lustre?

jrwrigh commented 8 months ago

When one of those 12 CPU processes on a given node calls CGNS and on to HDF5 is it trying to use threads to gain more parallelism in the read right or is it sticking to one process per MPI process?

I believe it's running multiple threads per MPI process. I can't think of another reason why pthreads_spin_lock would be called instead of MPI_wait if it was a single process.

KennethEJansen commented 8 months ago

I am hopefully not jinxing it but so far larger process counts are more successful with the minus one striping choice. The 1536 has not run yet but the 768 case read and wrote correctly all four times

kjansen@aurora-uan-0009:/lus/gecko/projects/PHASTA_aesp_CNDA/BumpQ2> du -sh *-4[6-9]000.cgns |head
197G    Q2fromQ1_21k-46000.cgns
197G    Q2fromQ1_21k-47000.cgns
197G    Q2fromQ1_21k-48000.cgns
197G    Q2fromQ1_21k-49000.cgns
425M    stats-46000.cgns
425M    stats-47000.cgns
425M    stats-48000.cgns
425M    stats-49000.cgns

kjansen@aurora-uan-0010:/lus/gecko/projects/PHASTA_aesp_CNDA/BumpQ2> grep VecView JZ768Nodes1215_240108.o*
JZ768Nodes1215_240108.o620924:VecView                2 1.0 1.7643e+01 1.0 1.43e+06 2.0 3.5e+05 2.1e+04 3.0e+01  1  0  0  0  0   1  0  0  0  0   720   7269759248      0 0.00e+00    0 0.00e+00  97
JZ768Nodes1215_240108.o620925:VecView                2 1.0 1.8186e+01 1.0 1.43e+06 2.1 3.5e+05 2.1e+04 3.0e+01  1  0  0  0  0   1  0  0  0  0   699   7277891282      0 0.00e+00    0 0.00e+00  97
JZ768Nodes1215_240108.o620926:VecView                2 1.0 1.3742e+01 1.0 1.43e+06 1.7 3.5e+05 2.1e+04 3.0e+01  1  0  0  0  0   1  0  0  0  0   925   7634409062      0 0.00e+00    0 0.00e+00  97
JZ768Nodes1215_240108.o620927:VecView                2 1.0 1.3611e+01 1.0 1.43e+06 2.1 3.5e+05 2.1e+04 3.0e+01  1  0  0  0  0   1  0  0  0  0   933   7258857984      0 0.00e+00    0 0.00e+00  97

We don't have timers yet on the reader but you can see the writer is called twice once for a big file and once for a small file and there is some variation in the performance but 13 to 18 seconds is more than acceptable I think for a large and small file. I am not sure when the 1536 case will be picked as there are only 2048 notes as far as I know.

KennethEJansen commented 8 months ago

My second battery of jobs are running and still so far so good with no read or write failures. Since we didn't really change code and only changed the luster striping I think we have to attribute this behavior to the Lustre striping or perhaps luck innot getting bad nodes. Will keep you posted as more data is obtained.

That said we still don't have any data from 1536 nodes. I am wondering if there are even 1536 nodes up.

KennethEJansen commented 8 months ago

With help from Tim Williams, the mystery of why my 1536 node jobs are not running is resolved

kjansen@aurora-uan-0010:/lus/gecko/projects/PHASTA_aesp_CNDA/petsc-kjansen-forkJZ> /home/zippy/bin/pu_nodeStat EarlyAppAccess

PARTITION: LustreApps (EarlyAppAccess)
------------------------
Nodes  Status
-----  ------
    1  down               
   17  free               
 1417  job-exclusive      
  417  offline            
    0  state-unknown      
   18  state-unknown,down 
    2  state-unknown,down,offline 
    0  state-unknown,offline 
    0  broken             
-----  --------------
 1872  Total Nodes  

I have queued 4 1152 node jobs but of course they will take a while to get enough priority to run (and machine emptying out as I am not sure they have a drain for large jobs with priority at this point anyway).

KennethEJansen commented 8 months ago

WOOHOOO. We are running on 1124 nodes, 13488 tiles and thus have broken the 10k GPU barrier finally (previously CGNS+HDF5+Lustre were erroring out on the read of our inputs).

No coded change. It is either the Lustre striping to -1 (32 stripes is the max I think) OR they finally pulled the bad nodes that could not really talk properly to the Lustre file system out of service.

I had to qalter my job node request down to what Tim's script said was available this morning to get it to go. They have a large job queue problem in that I suspect it drained all night to try to get a mysteriously missing 200+ nodes in job-exclusive category of that script. I have seen this many times on new machines so I just got around it with qalter.

Job appears to have run the requested 1k steps and written to CGNS correctly as well.

KennethEJansen commented 8 months ago
kjansen@aurora-uan-0009:/lus/gecko/projects/PHASTA_aesp_CNDA/BumpQ2> grep VecView  JZ1536Nodes1215_240108.o621462 
VecView                2 1.0 2.2446e+01 1.0 9.87e+05 2.0 5.2e+05 1.6e+04 3.0e+01  2  0  0  0  0   2  0  0  0  0   574   7316723234      0 0.00e+00    0 0.00e+00  97
VecView                2 1.0 2.2786e+01 1.0 9.87e+05 2.0 5.2e+05 1.6e+04 3.0e+01  2  0  0  0  0   2  0  0  0  0   565   7383870184      0 0.00e+00    0 0.00e+00  97
VecView                2 1.0 2.4723e+01 1.0 9.87e+05 2.0 5.2e+05 1.6e+04 3.0e+01  2  0  0  0  0   2  0  0  0  0   521   7377604385      0 0.00e+00    0 0.00e+00  97

so the job ran three times with the same inputs and ran out of time on the 4th. So that is 4/4 successful reads and 3/3 successful writes. Not that log file is miss-named since I qaltered the node count to hit was was available (1124).

22-24 seconds is about half the rate we were getting out of lower node count ( O (12) seconds) but still not bad.

brtnfld commented 8 months ago

What stripe size are you using? You might try setting the alignment in HDF5 to the Lustre stripe size. http://cgns.github.io/CGNS_docs_current/midlevel/fileops.html, CG_CONFIG_HDF5_ALIGNMENT

Are you doing independent or collective IO? What are those numbers in terms of bandwidth? Do you have the darshan logs?

KennethEJansen commented 8 months ago

@jedbrown and @jrwrigh might have better answers but I will share what I know: Stripe size:
We have only set stripe count. We initially did kjansen@aurora-uan-0009:/lus/gecko/projects/PHASTA_aesp_CNDA/BumpQ2> lfs setstripe -c 16 . but then went to -1. No direct setting of stripe size but I guess we can see what I got

kjansen@aurora-uan-0009:/lus/gecko/projects/PHASTA_aesp_CNDA/BumpQ2> lfs getstripe Q2fromQ1_21k-42000.cgns
Q2fromQ1_21k-42000.cgns
lmm_stripe_count:  16
lmm_stripe_size:   1048576
lmm_pattern:       raid0
lmm_layout_gen:    0
lmm_stripe_offset: 0
    obdidx       objid       objid       group
         0         1168972       0x11d64c      0xa80000405
         3         1167709       0x11d15d      0x980000405
         1         1169071       0x11d6af      0xc80000403
        11         1172051       0x11e253      0x900000405
       103         4342256       0x4241f0      0x440000bd1
        14         1172388       0x11e3a4      0xbc0000404
         7         1168387       0x11d403      0xb40000405
        10         1167154       0x11cf32      0xa40000405
        15         1169510       0x11d866      0xb00000405
         5         1168023       0x11d297      0x940000405
         2         1170305       0x11db81      0xac0000405
         4         1168529       0x11d491      0xc00000404
        12         1167890       0x11d212      0x9c0000405
         6         1169015       0x11d677      0xcc0000405
         8         1166910       0x11ce3e      0xa00000405
       109         3762900       0x396ad4      0x600000bd2

kjansen@aurora-uan-0009:/lus/gecko/projects/PHASTA_aesp_CNDA/BumpQ2> lfs setstripe -c -1 .
kjansen@aurora-uan-0009:/lus/gecko/projects/PHASTA_aesp_CNDA/BumpQ2> ls -alt |head -5
total 9900951924
-rw-r--r--  1 kjansen PHASTA_aesp_CNDA       656391 Feb 24 19:01 JZ192Nodes1215_240108.o622004
-rw-r--r--  1 kjansen PHASTA_aesp_CNDA    445378519 Feb 24 19:01 stats-44000.cgns
drwxr-sr-x 79 kjansen PHASTA_aesp_CNDA        73728 Feb 24 19:01 .
-rw-r--r--  1 kjansen PHASTA_aesp_CNDA 211362088776 Feb 24 19:01 Q2fromQ1_21k-44000.cgns
kjansen@aurora-uan-0009:/lus/gecko/projects/PHASTA_aesp_CNDA/BumpQ2> mv Q2fromQ1_21k-44000.cgns Q2fromQ1_21k-44000.cgns_asWritten
kjansen@aurora-uan-0009:/lus/gecko/projects/PHASTA_aesp_CNDA/BumpQ2> cp Q2fromQ1_21k-44000.cgns_asWritten Q2fromQ1_21k-44000.cgns
kjansen@aurora-uan-0009:/lus/gecko/projects/PHASTA_aesp_CNDA/BumpQ2> lfs getstripe Q2fromQ1_21k-44000.cgns
Q2fromQ1_21k-44000.cgns
lmm_stripe_count:  32
lmm_stripe_size:   1048576
lmm_pattern:       raid0
lmm_layout_gen:    0
lmm_stripe_offset: 1
    obdidx       objid       objid       group
         1         1169095       0x11d6c7      0xc80000403
       110         3835167       0x3a851f      0x680000bd4
         3         1167735       0x11d177      0x980000405
        13         1170984       0x11de28      0xc40000404
        15         1169535       0x11d87f      0xb00000405
         0         1168998       0x11d666      0xa80000405
         7         1168413       0x11d41d      0xb40000405
         5         1168047       0x11d2af      0x940000405
        11         1172072       0x11e268      0x900000405
        10         1167181       0x11cf4d      0xa40000405
       102         4182240       0x3fd0e0      0x4c0000bd1
         9         1167850       0x11d1ea      0xb80000405
        14         1172410       0x11e3ba      0xbc0000404
         4         1168554       0x11d4aa      0xc00000404
         2         1170329       0x11db99      0xac0000405
         6         1169039       0x11d68f      0xcc0000405
         8         1166937       0x11ce59      0xa00000405
       104         3737545       0x3907c9      0x500000bd1
        12         1167916       0x11d22c      0x9c0000405
       105         3952709       0x3c5045      0x640000bd1
       106         3894464       0x3b6cc0      0x6c0000bd1
       112         3842721       0x3aa2a1      0x400000bd4
       107         3878829       0x3b2fad      0x5c0000bd1
       108         3862786       0x3af102      0x540000bd2
       113         3777100       0x39a24c      0x480000bd4
       100         3753931       0x3947cb      0x780000bd1
       115         3744821       0x392435      0x740000bd4
       114         3779125       0x39aa35      0x7c0000405
       111         3672239       0x3808af      0x580000bd1
       103         4342896       0x424470      0x440000bd1
       109         3763538       0x396d52      0x600000bd2
       101         3689664       0x384cc0      0x700000bd1

Jed wrote the writer. I think it is independent if by that you mean each rank is writing its segment through the parallel routines. I modified the existing reader to read in parallel (assuming that is what you mean by independent??).

Two files are written. The first is 197 GB. The second is much smaller but I suspect that it adds latency time that distorts bandwidth numbers a tad. If I do the math right, 197GB/22 sec is about 9GB/s.

I don't know where to find or how to access darshan logs but I know Jed said that Rob Latham would help us take a look at them. I think he is trying to schedule a meeting for that.

jedbrown commented 8 months ago

I calculated 15-20 GB/s on the smaller node counts (96 or 192), so we're seeing about half that here on 1124 nodes. The stripe size is default, which looks like 1MiB. We use collective IO. I'm working with Rob Latham to get Darshan logs (it's not "supported" on Aurora yet).

KennethEJansen commented 8 months ago

According to ALCF, @brtnfld now has access to these project directories. Let me know if you need any orientation beyond the directories and file names I pasted above.