NVIDIA / pyxis

Container plugin for Slurm Workload Manager
Apache License 2.0
266 stars 30 forks source link

Dependency issue with nvcr.io/nvidia/tensorflow:22.05-tf1-py3 #82

Closed karanveersingh5623 closed 2 years ago

karanveersingh5623 commented 2 years ago

Hi team

Refering to below git repo

(https://github.com/mlcommons/hpc_results_v1.0/tree/master/NVIDIA/benchmarks/cosmoflow/implementations/mxnet)

I tried adding mount options in SRUN cmd , below is the trace , the nvidia tensorflow image gives the dependencies error . Should I create my own image or do we have some public image which handles this scenario ?

srun --ntasks=1 --container-image=nvcr.io#nvidia/tensorflow:22.05-tf1-py3 --container-name=cosmoflow-preprocess --container-workdir=/mnt/ --container-mounts=/root/hpc_results_v1.0/NVIDIA/benchmarks/cosmoflow/implementations/mxnet:/mnt bash tools/init_datasets.sh /mnt/cosmoUniverse_2019_05_4parE_tf_small /mnt/processed

pyxis: importing docker image ...
2022-06-29 01:18:38.954846: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
WARNING:tensorflow:Deprecation warnings have been disabled. Set TF_ENABLE_DEPRECATION_WARNINGS=1 to re-enable them.
Traceback (most recent call last):
  File "/usr/lib/python3.8/runpy.py", line 185, in _run_module_as_main
    mod_name, mod_spec, code = _get_module_details(mod_name, _Error)
  File "/usr/lib/python3.8/runpy.py", line 111, in _get_module_details
    __import__(pkg_name)
  File "/mnt/tools/convert_tfrecord_to_numpy.py", line 8, in <module>
    from mpi4py import MPI
ModuleNotFoundError: No module named 'mpi4py'
2022-06-29 01:18:40.133354: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
WARNING:tensorflow:Deprecation warnings have been disabled. Set TF_ENABLE_DEPRECATION_WARNINGS=1 to re-enable them.
Traceback (most recent call last):
  File "/usr/lib/python3.8/runpy.py", line 185, in _run_module_as_main
    mod_name, mod_spec, code = _get_module_details(mod_name, _Error)
  File "/usr/lib/python3.8/runpy.py", line 111, in _get_module_details
    __import__(pkg_name)
  File "/mnt/tools/convert_tfrecord_to_numpy.py", line 8, in <module>
    from mpi4py import MPI
ModuleNotFoundError: No module named 'mpi4py'
tools/init_datasets.sh: line 6: /root/deepops/processed/train/files_data.lst: No such file or directory
ls: cannot access '/root/deepops/processed/train': No such file or directory
tools/init_datasets.sh: line 7: /root/deepops/processed/validation/files_data.lst: No such file or directory
ls: cannot access '/root/deepops/processed/validation': No such file or directory
tools/init_datasets.sh: line 8: /root/deepops/processed/train/files_label.lst: No such file or directory
ls: cannot access '/root/deepops/processed/train': No such file or directory
tools/init_datasets.sh: line 9: /root/deepops/processed/validation/files_label.lst: No such file or directory
ls: cannot access '/root/deepops/processed/validation': No such file or directory
srun: error: mlperf1: task 0: Exited with exit code 1
flx42 commented 2 years ago

My understanding is that you need to build your own docker image using this Dockerfile: https://github.com/mlcommons/hpc_results_v1.0/blob/master/NVIDIA/benchmarks/cosmoflow/implementations/mxnet/Dockerfile