I have been trying to run a simple Tensorflow + Horovod training task using the rocm-tensorflow-rocm5.5-tf2.11-dev tensorflow container. Unfortunately, I find it lacks several Python modules. For instance,
$ python3 01_horovod_mnist.py
2023-08-10 13:12:18.690627: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: SSE3 SSE4.1 SSE4.2 AVX AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
Traceback (most recent call last):
File "/scratch/pawsey0001/cdipietrantonio/cdipietrantonio-machinelearning/models/01_horovod_mnist.py", line 34, in <module>
import horovod.tensorflow as hvd
File "/usr/local/lib/python3.9/dist-packages/horovod-0.27.0-py3.9-linux-x86_64.egg/horovod/tensorflow/__init__.py", line 27, in <module>
from horovod.tensorflow import elastic
File "/usr/local/lib/python3.9/dist-packages/horovod-0.27.0-py3.9-linux-x86_64.egg/horovod/tensorflow/elastic.py", line 22, in <module>
from horovod.common.elastic import run_fn, ObjectState
File "/usr/local/lib/python3.9/dist-packages/horovod-0.27.0-py3.9-linux-x86_64.egg/horovod/common/elastic.py", line 20, in <module>
from horovod.runner.elastic.worker import HostUpdateResult, WorkerNotificationManager
File "/usr/local/lib/python3.9/dist-packages/horovod-0.27.0-py3.9-linux-x86_64.egg/horovod/runner/elastic/worker.py", line 21, in <module>
from horovod.runner.common.util import network, secret
File "/usr/local/lib/python3.9/dist-packages/horovod-0.27.0-py3.9-linux-x86_64.egg/horovod/runner/common/util/network.py", line 16, in <module>
import psutil
ModuleNotFoundError: No module named 'psutil'
I have been trying to run a simple Tensorflow + Horovod training task using the
rocm-tensorflow-rocm5.5-tf2.11-dev
tensorflow container. Unfortunately, I find it lacks several Python modules. For instance,