cea-hpc / wi4mpi

Wrapper interface for MPI
BSD 3-Clause "New" or "Revised" License
80 stars 15 forks source link

Resource deadlock avoided with Horovod within a container #74

Closed marcjoos-cea closed 6 months ago

marcjoos-cea commented 7 months ago

Context

Tests

$ <container launcher> -n 40 -c 4 python3 horovod-synth.py
Model: ResNet50
Batch size: 32
Number of GPUs: 40
Iter #0: 5.0 img/sec per GPU
Iter #1: 7.8 img/sec per GPU
Iter #2: 7.8 img/sec per GPU
Iter #3: 7.8 img/sec per GPU
Iter #4: 7.8 img/sec per GPU
Iter #5: 7.8 img/sec per GPU
Iter #6: 7.8 img/sec per GPU
Iter #7: 7.8 img/sec per GPU
Iter #8: 7.9 img/sec per GPU
Iter #9: 7.9 img/sec per GPU
Img/sec per GPU: 7.8 +-0.1
Total img/sec on 40 GPU(s): 312.7 +-3.4
$ <container launcher> -n 40 -c 4 python3 horovod-synth.py
2024-01-24 13:30:45.480400: I tensorflow/core/platform/cpu_feature_guard.cc:181] Beginning TensorFlow 2.15, this package will be updated to install stock TensorFlow 2.15 alongside Intel's TensorFlow CPU extension plugin, which provides all the optimizations available in the package and more. If a compatible version of stock TensorFlow is present, only the extension will get installed. No changes to code or installation setup is needed as a result of this change.
More information on Intel's optimizations for TensorFlow, delivered as TensorFlow extension plugin can be viewed at https://github.com/intel/intel-extension-for-tensorflow.
2024-01-24 13:30:45.480438: I tensorflow/core/platform/cpu_feature_guard.cc:192] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.

python3:7 terminated with signal 11 at PC=154bdcb1e90d SP=154b51dfb1d0.  Backtrace:
<wi4mpi install path>/libexec/wi4mpi/libwi4mpi_MPICH_OMPI.so(wi4mpi_set_timeout+0x2d)[0x154bdcb1e90d]
<wi4mpi install path>/libexec/wi4mpi/libwi4mpi_MPICH_OMPI.so(A_MPI_Initialized+0x15)[0x154bdca72255]
/usr/local/lib/python3.10/dist-packages/horovod/tensorflow/mpi_lib.cpython-310-x86_64-linux-gnu.so(_ZN7horovod6common10MPIContext10InitializeERNS0_17MPIContextManagerE+0x98)[0x154b555bbf48]
/usr/local/lib/python3.10/dist-packages/horovod/tensorflow/mpi_lib.cpython-310-x86_64-linux-gnu.so(+0x64ad2)[0x154b55567ad2]
/usr/local/lib/python3.10/dist-packages/tensorflow/python/platform/../../libtensorflow_framework.so.2(+0x1af59f0)[0x154b98c0f9f0]
/lib/x86_64-linux-gnu/libc.so.6(+0x94ac3)[0x154bdb86cac3]
/lib/x86_64-linux-gnu/libc.so.6(clone+0x44)[0x154bdb8fdbf4]
terminate called after throwing an instance of 'std::system_error'
  what():  Resource deadlock avoided

python3:7 terminated with signal 6 at PC=154bdb86e9fc SP=154b51df9ba0.  Backtrace:
/lib/x86_64-linux-gnu/libc.so.6(pthread_kill+0x12c)[0x154bdb86e9fc]
/lib/x86_64-linux-gnu/libc.so.6(raise+0x16)[0x154bdb81a476]
/lib/x86_64-linux-gnu/libc.so.6(abort+0xd3)[0x154bdb8007f3]
/lib/x86_64-linux-gnu/libstdc++.so.6(+0xa2b9e)[0x154b96d5eb9e]
/lib/x86_64-linux-gnu/libstdc++.so.6(+0xae20c)[0x154b96d6a20c]
/lib/x86_64-linux-gnu/libstdc++.so.6(+0xad1e9)[0x154b96d691e9]
/lib/x86_64-linux-gnu/libstdc++.so.6(__gxx_personality_v0+0x99)[0x154b96d69959]
/lib/x86_64-linux-gnu/libgcc_s.so.1(+0x16884)[0x154bdc9a9884]
/lib/x86_64-linux-gnu/libgcc_s.so.1(_Unwind_RaiseException+0x311)[0x154bdc9a9f41]
/lib/x86_64-linux-gnu/libstdc++.so.6(__cxa_throw+0x3b)[0x154b96d6a4cb]
/lib/x86_64-linux-gnu/libstdc++.so.6(_ZSt20__throw_system_errori+0x96)[0x154b96d6183c]
/lib/x86_64-linux-gnu/libstdc++.so.6(_ZNSt6thread6detachEv+0x0)[0x154b96d982e0]
/usr/local/lib/python3.10/dist-packages/horovod/tensorflow/mpi_lib.cpython-310-x86_64-linux-gnu.so(_ZN7horovod6common18HorovodGlobalStateD1Ev+0xd3d)[0x154b55572e4d]
/lib/x86_64-linux-gnu/libc.so.6(+0x45495)[0x154bdb81d495]
/lib/x86_64-linux-gnu/libc.so.6(on_exit+0x0)[0x154bdb81d610]
/lib/x86_64-linux-gnu/libinfinipath.so.4(+0x42a7)[0x154b52ca72a7]
/lib/x86_64-linux-gnu/libc.so.6(+0x42520)[0x154bdb81a520]
<wi4mpi install path>/libexec/wi4mpi/libwi4mpi_MPICH_OMPI.so(wi4mpi_set_timeout+0x2d)[0x154bdcb1e90d]
<wi4mpi install path>/libexec/wi4mpi/libwi4mpi_MPICH_OMPI.so(A_MPI_Initialized+0x15)[0x154bdca72255]
/usr/local/lib/python3.10/dist-packages/horovod/tensorflow/mpi_lib.cpython-310-x86_64-linux-gnu.so(_ZN7horovod6common10MPIContext10InitializeERNS0_17MPIContextManagerE+0x98)[0x154b555bbf48]
/usr/local/lib/python3.10/dist-packages/horovod/tensorflow/mpi_lib.cpython-310-x86_64-linux-gnu.so(+0x64ad2)[0x154b55567ad2]
/usr/local/lib/python3.10/dist-packages/tensorflow/python/platform/../../libtensorflow_framework.so.2(+0x1af59f0)[0x154b98c0f9f0]
/lib/x86_64-linux-gnu/libc.so.6(+0x94ac3)[0x154bdb86cac3]
/lib/x86_64-linux-gnu/libc.so.6(clone+0x44)[0x154bdb8fdbf4]
srun: error: <node1234>: task 0: Exited with exit code 1

Further comments

OSU MicroBenchmark tests

RUN pip install mpi4py

 - that allows me to validate that Wi4MPI is working well with Python + MPI with a simple test:
```python
from mpi4py import MPI

comm = MPI.COMM_WORLD
rank = comm.Get_rank()
size = comm.Get_size()

dest = (rank+1)%size
source = (rank-1)%size
data = {'rank': rank, 'rank x answer': 42*rank, 'rank x pi': 3.141592*rank}

comm.send(data, dest=dest)
data = comm.recv(source=source)

print('On process {}, data is {}'.format(rank, data))
kevin-juilly commented 7 months ago

From my tests, it seems the timeout feature does not work properly. Threads other than the main one aren't registered and wi4mpi_set_timeout dereferences a null pointer in those. Not sure why the error talks about deadlock avoided, maybe because the pointer is in TLS? I will try to write a small reproducer to confirm those observations.