amd / InfinityHub-CI

MIT License
15 stars 5 forks source link

Tensorflow container lacks some Python modules #8

Closed dipietrantonio closed 3 months ago

dipietrantonio commented 1 year ago

I have been trying to run a simple Tensorflow + Horovod training task using the rocm-tensorflow-rocm5.5-tf2.11-dev tensorflow container. Unfortunately, I find it lacks several Python modules. For instance,

$ python3 01_horovod_mnist.py 
2023-08-10 13:12:18.690627: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  SSE3 SSE4.1 SSE4.2 AVX AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
Traceback (most recent call last):
  File "/scratch/pawsey0001/cdipietrantonio/cdipietrantonio-machinelearning/models/01_horovod_mnist.py", line 34, in <module>
    import horovod.tensorflow as hvd
  File "/usr/local/lib/python3.9/dist-packages/horovod-0.27.0-py3.9-linux-x86_64.egg/horovod/tensorflow/__init__.py", line 27, in <module>
    from horovod.tensorflow import elastic
  File "/usr/local/lib/python3.9/dist-packages/horovod-0.27.0-py3.9-linux-x86_64.egg/horovod/tensorflow/elastic.py", line 22, in <module>
    from horovod.common.elastic import run_fn, ObjectState
  File "/usr/local/lib/python3.9/dist-packages/horovod-0.27.0-py3.9-linux-x86_64.egg/horovod/common/elastic.py", line 20, in <module>
    from horovod.runner.elastic.worker import HostUpdateResult, WorkerNotificationManager
  File "/usr/local/lib/python3.9/dist-packages/horovod-0.27.0-py3.9-linux-x86_64.egg/horovod/runner/elastic/worker.py", line 21, in <module>
    from horovod.runner.common.util import network, secret
  File "/usr/local/lib/python3.9/dist-packages/horovod-0.27.0-py3.9-linux-x86_64.egg/horovod/runner/common/util/network.py", line 16, in <module>
    import psutil
ModuleNotFoundError: No module named 'psutil'
cmcknigh commented 3 months ago

That was relayed to the team that handles the publication of those ROCm Tensorflow container publishing. This issue should have been resolved already.