VI4IO / pfind

Parallel find using MPI
GNU Lesser General Public License v3.0
8 stars 12 forks source link

Running a mpich4.1 compiled pfind fails with communicator unmatched message(s) error #19

Closed Pidad closed 1 year ago

Pidad commented 1 year ago

Running a mpich4.1 compiled pfind fails with communicator unmatched message(s) error as below:

[root@host1 io500-main]# mpiexec -np 90 -f /home/user1/io500/client_nodes /home/user1/io500/io500-main/bin/pfind /gpfs/fs1/2023.02.20-00.09.33 -newer /home/user1/io500/results/2023.02.20-00.09.33/timestampfile -size 3901c -name 01 -C -q 10000 [DONE] found: 11481 (scanned 3359340 files, scanned dirents: 3359526, unknown dirents: 0) MATCHED 11481/3359340 <<< successful completion of workload

Abort(808024079) on node 0 (rank 0 in comm 0): Fatal error in internal_Finalize: Other MPI error, error stack: internal_Finalize(50)...........: MPI_Finalize failed MPII_Finalize(390)..............: MPIR_finalize_builtin_comms(154): MPIR_Comm_release_always(1250)..: MPIR_Comm_delete_internal(1225).: Communicator (handle=44000000) being freed has 1 unmatched message(s)

[root@host1 io500-main]# ldd /home/user1/io500/io500-main/bin/pfind linux-vdso.so.1 (0x00007ffde84aa000) libm.so.6 => /lib64/libm.so.6 (0x00007f0326601000) libgpfs.so => /lib64/libgpfs.so (0x00007f03263eb000) libmpi.so.12 => /home/user1/mpich-4.1/install/lib/libmpi.so.12 (0x00007f0323c3a000) libc.so.6 => /lib64/libc.so.6 (0x00007f0323878000) /lib64/ld-linux-x86-64.so.2 (0x00007f0326983000) libstdc++.so.6 => /lib64/libstdc++.so.6 (0x00007f03234e3000) libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x00007f03232cb000) libucp.so.0 => /lib64/libucp.so.0 (0x00007f0322ffd000) libucs.so.0 => /lib64/libucs.so.0 (0x00007f0322c5d000) libpthread.so.0 => /lib64/libpthread.so.0 (0x00007f0322a3d000) librt.so.1 => /lib64/librt.so.1 (0x00007f0322834000) libuct.so.0 => /usr/lib64/libuct.so.0 (0x00007f03225f9000) libnuma.so.1 => /usr/lib64/libnuma.so.1 (0x00007f03223ed000) libucm.so.0 => /usr/lib64/libucm.so.0 (0x00007f03221d3000) libz.so.1 => /usr/lib64/libz.so.1 (0x00007f0321fbc000) libdl.so.2 => /usr/lib64/libdl.so.2 (0x00007f0321db8000)

The failure message indicates that there is an unmatched message on a communicator. Someone sent a message, but there was no matching recv call. This would require additional investigation in the pfind code.

This unmatched message is exposed with mpich PR (https://github.com/pmodels/mpich/pull/6186 - Check for unmatched messages before releasing context_id) changes.

Configuring with --enable-error-checking=no (may not be a good idea though), did avoid the above failure. However, it throws the below warning message. [root@host1 io500-main]# mpiexec -np 10 -f /home/user1/io500/client_nodes /home/user1/io500/io500-main/bin/pfind /gpfs/fs1/2023.02.20-00.09.33 -newer /home/user1/io500/results/2023.02.20-00.09.33/timestampfile -size 3901c -name 01 -C -q 10000 [DONE] found: 11481 (scanned 3359340 files, scanned dirents: 3359526, unknown dirents: 0) MATCHED 11481/3359340 [1676891053.797137] [host1:1062653:0] tag_match.c:62 UCX WARN unexpected tag-receive descriptor 0x1d86040 was not matched

JulianKunkel commented 1 year ago

Thanks for reporting this. I assume it must be due to the asynchronous passing around of the completion token. Can you check the patch.

Pidad commented 1 year ago

Thanks, @JulianKunkel for the quick fix. It did fix the reported issue.

Find more details below: With fix, it terminates cleanly [root@host1 io500-main]# /home/user1/mpich-4.1/install/bin/mpiexec -np 100 -f /home/user1/io500/nodesib /home/user1/io500-main/bin/pfind /gpfs/fs1/2023.04.17-00.09.13/ -newer /home/user1/io500/results/1BB/2023.04.17-00.09.13/timestampfile -size 3901c -name 01 -C -q 10000 [DONE] found: 6223 (scanned 1470701 files, scanned dirents: 1470908, unknown dirents: 0) MATCHED 6223/1470701

Also with openmpi-4.1-5a1, the following warning was observed and is not seen with the pfind fix. 
[1681716292.051815] [arches1:2665343:0] tag_match.c:62 UCX WARN unexpected tag-receive descriptor 0x157bf80 was not matched

Note that the above failure/warning messages are only seen with the standalone run of pfind and not observed with the io500.sh based runs.

Pidad commented 1 year ago

Closing as the fix is been merged.