Closed agemagician closed 4 years ago
The TensorFlow LMS version you are using (TFLMSv2 in WML CE 1.6.2) is a closed source implementation and its code is not in this repository. See here.
I would highly recommend updating to the new Large Model Support for TensorFlow 2 implementation which is open source in the master branch of this repository. This implementation is also available from the WML CE early access channel, and it is built into the TensorFlow 2.1.0 GPU package there. This link has more information about the WML CE early access channel. You should be able to install this version on Summit. The documentation on how to enable and use this version of LMS is in this github repository.
The issue you are hitting with TFLMSv2 is most likely an issue we recently found with TFLMSv2 and modern versions of Horovod. Earlier versions of Horovod from last year did not hit this. Horovod changed how it limits the GPU that each of the distributed processes sees. It now uses TensorFlow's GPU limiting functionality. Which part of the process touches the GPU first (LMS vs Horovod vs other TensorFlow code) now matters.
In your code you are using LMS auto-tuning which touches the GPU to find its memory capacity. To workaround this issue you should either specify values for the swapout_threshold
, swapin_groupby
, and swapin_ahead
parameters, which avoids the auto-tuning, or you can explicitly set the memory auto tuning level using the autotune_gpu_mem
LMS property.
So either:
lms_hook = LMS(swapout_threshold=1, swapin_groupby=0, swapin_ahead=1) # These are the max swapping, slowest data throughput parameters. Adding sync_mode=3 would also allow for higher amount of data.
or
lms_hook = LMS()
lms.autotune_gpu_mem = 15 # For using 15 GB of GPU memory to leave a bit for base overhead usage.
On the topic of BERT, @FarrandTom has trained BERT large on 16GB GPUs using the TFLMSv2 version. He wrote this post about it: https://medium.com/systems-ai/the-consequence-of-modern-nlp-approaches-647b2cabc5ec
I would suggest reading that article as he mentions which LMS tuning parameter values he used.
@agemagician
Tensorflow 2.1 in the early access channel is built using Cuda 10.2, and summit has the 10.1 driver installed. You'll need to use the cuda compatibility libraries. You can do this by installing the cudatoolkit-dev
conda package and adding $CONDA_PREFIX/compat
to your LD_LIBRARY_PATH
@smatzek Thanks a lot for the detailed explanation and for the blog post.
The first method did work out pretty well:
lms_hook = LMS(swapout_threshold=1, swapin_groupby=0, swapin_ahead=1) # These are the max swapping, slowest data throughput parameters. Adding sync_mode=3 would also allow for higher amount of data.
However, the second method didn't work:
lms_hook = LMS()
lms_hook.autotune_gpu_mem = 15 # For using 15 GB of GPU memory to leave a bit for base overhead usage.
@bethune-bryant Thanks a lot. I am trying now to use the early access channel, but I had to use a clone from "ibm-wml-ce/1.6.2-2" module because of SUMMIT license, and this makes solving environment conflicts from conda takes a lot of time.
@agemagician I will look into the autotune_gpu_mem
method not working for you. Can you share the error message you got using that method? Was it the same tensorflow.python.framework.errors_impl.InvalidArgumentError: Invalid device ordinal value (1). Valid range is [0, 0].
or was it an out of memory error?
With that set and with LMS constructed like this: LMS()
, TFLMSv2 will attempt to measure your model size and auto-tune / predict the optimal runtime values for the tunable parameters. This has various levels of success depending on your model.
@agemagician
You don't necessarily have to clone all of ibm-wml-ce/1.6.2-2
. If you module load ibm-wml-ce
and then conda create -n my_env python=3.6 ddl etc...
you can install just the packages you need, which will install from the summit specific channel (as long as your .condarc
file doesn't have anything higher priority in it). Once those are installed you can then add the early access channel to my_env as shown in the blog post, and install the early access packages you need.
@bethune-bryant unfortunately, after I did as you recommended, I got an error everytime I try to run ddlrun:
/sw/summit/ibm-wml-ce/anaconda-base/etc/profile.d/conda.sh: line 0: source: filename argument required
source: usage: source filename [arguments]
/sw/summit/ibm-wml-ce/anaconda-base/etc/profile.d/conda.sh: line 0: source: filename argument required
source: usage: source filename [arguments]
/sw/summit/ibm-wml-ce/anaconda-base/etc/profile.d/conda.sh: line 0: source: filename argument required
source: usage: source filename [arguments]
[ERROR DDL-2-0] Unexpected Error: Sequence index out of range.
Traceback (most recent call last):
File "/ccs/proj/bif120/deepforce/virtualenvs/ibm_wml_ce_beta/bin/ddlrun", line 50, in main
cores = hardware_checks.verify_host_configs(args)
File "/ccs/proj/bif120/deepforce/virtualenvs/ibm_wml_ce_beta/lib/python3.6/site-packages/ddlrun/hardware_checks.py", line 46, in verify_host_configs
host_cores.append(split_config[1])
IndexError: list index out of range
Please see /tmp/DDLRUN/DDLRUN.ew1BrFeootoh/ddlrun.log for detailed log.
This error doesn't occur on any other conda environment except the new one:
# Name Version Build Channel
_libgcc_mutex 0.1 conda_forge conda-forge
_openmp_mutex 4.5 0_gnu conda-forge
_tflow_select 2.1.0 gpu_913.g4f6e601 https://public.dhe.ibm.com/ibmdl/export/pub/software/server/ibm-ai/conda-early-access
absl-py 0.8.1 py36_0
asn1crypto 1.3.0 py36_0
astor 0.8.0 py36_0
atomicwrites 1.3.0 py36_1
attrs 19.3.0 py_0
c-ares 1.15.0 h7b6447c_1001
ca-certificates 2020.1.1 0
certifi 2019.11.28 py36_0
cffi 1.12.3 py36h2e261b9_0
chardet 3.0.4 py36_1003
cloudpickle 1.2.2 py_0
cryptography 2.8 py36h1ba5d50_0
cudatoolkit 10.2.89 654.g0f7a43a https://public.dhe.ibm.com/ibmdl/export/pub/software/server/ibm-ai/conda-early-access
cudnn 7.6.5_10.2 624.g338a052 https://public.dhe.ibm.com/ibmdl/export/pub/software/server/ibm-ai/conda-early-access
ddl 1.5.1 py36_1355.ga7f65f4 https://public.dhe.ibm.com/ibmdl/export/pub/software/server/ibm-ai/conda-early-access
freetype 2.9.1 h8a8886c_0
gast 0.2.2 py36_0
google-pasta 0.1.8 py36_620.gd00f35a https://public.dhe.ibm.com/ibmdl/export/pub/software/server/ibm-ai/conda-early-access
graphsurgeon 0.4.1 py36_683.g120274a https://public.dhe.ibm.com/ibmdl/export/pub/software/server/ibm-ai/conda-early-access
grpcio 1.16.1 py36hf8bcb03_1
h5py 2.10.0 nompi_py36h25dc415_102 conda-forge
hdf5 1.10.5 nompi_h9bc996f_1104 conda-forge
horovod 0.19.0 py36_1096.g1e4bf23 https://public.dhe.ibm.com/ibmdl/export/pub/software/server/ibm-ai/conda-early-access
idna 2.8 py36_0
importlib_metadata 1.5.0 py36_0
jpeg 9b hcb7ba68_2
keras-applications 1.0.8 py_0
keras-preprocessing 1.1.0 py_1
libblas 3.8.0 14_openblas conda-forge
libcblas 3.8.0 14_openblas conda-forge
libffi 3.2.1 hb209c28_1006 conda-forge
libgcc-ng 8.2.0 hdd5993f_5 conda-forge
libgfortran-ng 8.2.0 h822a55f_5 conda-forge
libgomp 8.2.0 hdd5993f_5 conda-forge
liblapack 3.8.0 14_openblas conda-forge
libopenblas 0.3.7 ha38281c_6 conda-forge
libpng 1.6.37 hbc83047_0
libprotobuf 3.8.0 632.g08dc819 https://public.dhe.ibm.com/ibmdl/export/pub/software/server/ibm-ai/conda-early-access
libstdcxx-ng 8.2.0 h822a55f_5 conda-forge
libtiff 4.1.0 h2733197_0
markdown 3.1.1 py36_0
more-itertools 8.2.0 py_0
nccl 2.5.6 619.g51c2e94 https://public.dhe.ibm.com/ibmdl/export/pub/software/server/ibm-ai/conda-early-access
ncurses 6.1 hf484d3e_1002 conda-forge
numactl 2.0.12 626.gb5e1afd https://public.dhe.ibm.com/ibmdl/export/pub/software/server/ibm-ai/conda-early-access
numpy 1.17.5 py36hcee8f07_0 conda-forge
olefile 0.46 py36_0
openssl 1.1.1d h7b6447c_4
opt_einsum 3.1.0 py_0
pciutils 3.6.2 625.g804ec60 https://public.dhe.ibm.com/ibmdl/export/pub/software/server/ibm-ai/conda-early-access
pillow 7.0.0 py36haac5956_0
pip 20.0.2 py_2 conda-forge
pluggy 0.13.1 py36_0
powerai-license 1.7.0.a0 772.ge074133 https://public.dhe.ibm.com/ibmdl/export/pub/software/server/ibm-ai/conda-early-access
powerai-release 1.7.0.a0 625.g1c389a2 https://public.dhe.ibm.com/ibmdl/export/pub/software/server/ibm-ai/conda-early-access
powerai-tools 1.7.0.a0 621.g843ad38 https://public.dhe.ibm.com/ibmdl/export/pub/software/server/ibm-ai/conda-early-access
protobuf 3.8.0 py36_640.gdc7b773 https://public.dhe.ibm.com/ibmdl/export/pub/software/server/ibm-ai/conda-early-access
psutil 5.6.7 py36h7b6447c_0
py 1.8.1 py_0
pycparser 2.19 py36_0
pyopenssl 19.1.0 py36_0
pysocks 1.7.1 py36_0
pytest 4.4.2 py36_0
python 3.6.7 h88bc6d3_1006 conda-forge
pyyaml 5.1.2 py36h6eb9509_1 conda-forge
readline 8.0 hf8c457e_0 conda-forge
requests 2.22.0 py36_1
scipy 1.4.1 py36h807e534_0 conda-forge
setuptools 45.2.0 py36_0 conda-forge
six 1.13.0 py36_0
spectrum-mpi 10.03 676.ga72dafb https://public.dhe.ibm.com/ibmdl/export/pub/software/server/ibm-ai/conda-early-access
sqlite 3.30.1 hd61ad8c_0 conda-forge
tensorboard 2.1.0 py36_3dc74fe_3939.g4f6e601 https://public.dhe.ibm.com/ibmdl/export/pub/software/server/ibm-ai/conda-early-access
tensorflow 2.1.0 gpu_py36_914.g4f6e601 https://public.dhe.ibm.com/ibmdl/export/pub/software/server/ibm-ai/conda-early-access
tensorflow-base 2.1.0 gpu_py36_e5bf8de_72632.gbc9303f https://public.dhe.ibm.com/ibmdl/export/pub/software/server/ibm-ai/conda-early-access
tensorflow-estimator 2.1.0 py36_7ec4e5d_1461.g4f6e601 https://public.dhe.ibm.com/ibmdl/export/pub/software/server/ibm-ai/conda-early-access
tensorflow-gpu 2.1.0 914.g4f6e601 https://public.dhe.ibm.com/ibmdl/export/pub/software/server/ibm-ai/conda-early-access
tensorrt 7.0.0.11 py36_683.g120274a https://public.dhe.ibm.com/ibmdl/export/pub/software/server/ibm-ai/conda-early-access
termcolor 1.1.0 py36_1
tk 8.6.10 h151fe60_0 conda-forge
uff 0.6.5 py36_683.g120274a https://public.dhe.ibm.com/ibmdl/export/pub/software/server/ibm-ai/conda-early-access
urllib3 1.25.8 py36_0
werkzeug 0.16.0 py_0
wheel 0.34.2 py_1 conda-forge
wrapt 1.11.2 py36h7b6447c_0
xz 5.2.4 h14c3975_1001 conda-forge
yaml 0.2.2 h6eb9509_1 conda-forge
zipp 2.2.0 py_0
zlib 1.2.11 h6eb9509_1006 conda-forge
zstd 1.3.7 h0b5b093_0
@smatzek I got the following error:
I0214 16:05:54.669146 35184372402448 lms.py:1275] [LMS][0] Editing model for LMS
I0214 16:05:55.062001 35184372402448 lms.py:1275] [LMS][0] Editing model for LMS
I0214 16:05:55.781136 35184372402448 lms.py:1275] [LMS][0] Editing model for LMS
I0214 16:05:55.931550 35184372402448 lms.py:1275] [LMS][0] Editing model for LMS
I0214 16:05:56.013345 35184372402448 lms.py:1275] [LMS][0] Editing model for LMS
I0214 16:05:56.534431 35184372402448 lms.py:1275] [LMS][0] Editing model for LMS
I0214 16:05:56.731864 35184372402448 lms.py:1275] [LMS][0] The graph has 58391 vertices and 72072 edges.
I0214 16:05:56.732050 35184372402448 lms.py:1275] [LMS][0] The graph has 1451.68 MiB of learning parameters.
I0214 16:05:56.732136 35184372402448 lms.py:1275] [LMS][0] The largest GPU operation is bert/encoder/layer_0/attention/self/dropout/mul_1 consuming 0.42 GiB
I0214 16:05:57.133826 35184372402448 lms.py:1275] [LMS][0] The graph has 58391 vertices and 72072 edges.
I0214 16:05:57.134008 35184372402448 lms.py:1275] [LMS][0] The graph has 1451.68 MiB of learning parameters.
I0214 16:05:57.134092 35184372402448 lms.py:1275] [LMS][0] The largest GPU operation is bert/encoder/layer_0/attention/self/dropout/mul_1 consuming 0.42 GiB
I0214 16:05:57.868971 35184372402448 lms.py:1275] [LMS][0] The graph has 58391 vertices and 72072 edges.
I0214 16:05:57.869180 35184372402448 lms.py:1275] [LMS][0] The graph has 1451.68 MiB of learning parameters.
I0214 16:05:57.869258 35184372402448 lms.py:1275] [LMS][0] The largest GPU operation is bert/encoder/layer_0/attention/self/dropout/mul_1 consuming 0.42 GiB
I0214 16:05:58.000467 35184372402448 lms.py:1275] [LMS][0] The graph has 58391 vertices and 72072 edges.
I0214 16:05:58.000693 35184372402448 lms.py:1275] [LMS][0] The graph has 1451.68 MiB of learning parameters.
I0214 16:05:58.000778 35184372402448 lms.py:1275] [LMS][0] The largest GPU operation is bert/encoder/layer_0/attention/self/dropout/mul_1 consuming 0.42 GiB
I0214 16:05:58.088652 35184372402448 lms.py:1275] [LMS][0] The graph has 58391 vertices and 72072 edges.
I0214 16:05:58.088860 35184372402448 lms.py:1275] [LMS][0] The graph has 1451.68 MiB of learning parameters.
I0214 16:05:58.088947 35184372402448 lms.py:1275] [LMS][0] The largest GPU operation is bert/encoder/layer_0/attention/self/dropout/mul_1 consuming 0.42 GiB
I0214 16:05:58.589145 35184372402448 lms.py:1275] [LMS][0] The graph has 58391 vertices and 72072 edges.
I0214 16:05:58.589344 35184372402448 lms.py:1275] [LMS][0] The graph has 1451.68 MiB of learning parameters.
I0214 16:05:58.589427 35184372402448 lms.py:1275] [LMS][0] The largest GPU operation is bert/encoder/layer_0/attention/self/dropout/mul_1 consuming 0.42 GiB
I0214 16:11:37.379312 35184372402448 lms.py:1275] [LMS][0] Original categorized topological sort has 6815 levels.
I0214 16:11:37.906052 35184372402448 lms.py:1275] [LMS][0] Searching values for parameters: swapout_threshold, swapin_ahead, swapin_groupby and sync_mode.
I0214 16:11:37.976045 35184372402448 lms.py:1275] [LMS][0] [Simulator] Available memory for simulation: 10.87 GiB (memory ratio: 0.8)
I0214 16:11:37.976279 35184372402448 lms.py:1275] [LMS][0] [Simulator] Available CPU memory for simulation: 64.0 GiB
I0214 16:11:38.402456 35184372402448 lms.py:1275] [LMS][0] Original categorized topological sort has 6815 levels.
I0214 16:11:38.920334 35184372402448 lms.py:1275] [LMS][0] Searching values for parameters: swapout_threshold, swapin_ahead, swapin_groupby and sync_mode.
I0214 16:11:38.990986 35184372402448 lms.py:1275] [LMS][0] [Simulator] Available memory for simulation: 10.87 GiB (memory ratio: 0.8)
I0214 16:11:38.991198 35184372402448 lms.py:1275] [LMS][0] [Simulator] Available CPU memory for simulation: 64.0 GiB
I0214 16:11:40.429673 35184372402448 lms.py:1275] [LMS][0] Original categorized topological sort has 6815 levels.
I0214 16:11:40.940869 35184372402448 lms.py:1275] [LMS][0] Searching values for parameters: swapout_threshold, swapin_ahead, swapin_groupby and sync_mode.
E0214 16:11:40.999454 35184372402448 error_handling.py:75] Error recorded from training_loop: Auto-tuning was unable to find a value for swapout_threshold. Please specify it manually.
I0214 16:11:40.999622 35184372402448 error_handling.py:101] training_loop marked as finished
W0214 16:11:40.999709 35184372402448 error_handling.py:135] Reraising captured error
I0214 16:11:41.009051 35184372402448 lms.py:1275] [LMS][0] [Simulator] Available memory for simulation: 10.87 GiB (memory ratio: 0.8)
I0214 16:11:41.009251 35184372402448 lms.py:1275] [LMS][0] [Simulator] Available CPU memory for simulation: 64.0 GiB
Traceback (most recent call last):
File "run_pretraining_hvd.py", line 542, in <module>
tf.app.run()
File "/ccs/proj/bif120/deepforce/virtualenvs/ibm_wml_ce-1.6.2-2/lib/python3.6/site-packages/tensorflow_core/python/platform/app.py", line 40, in run
_run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef)
File "/ccs/proj/bif120/deepforce/virtualenvs/ibm_wml_ce-1.6.2-2/lib/python3.6/site-packages/absl/app.py", line 300, in run
_run_main(main, args)
File "/ccs/proj/bif120/deepforce/virtualenvs/ibm_wml_ce-1.6.2-2/lib/python3.6/site-packages/absl/app.py", line 251, in _run_main
sys.exit(main(argv))
File "run_pretraining_hvd.py", line 515, in main
estimator.train(input_fn=train_input_fn, max_steps=FLAGS.num_train_steps, hooks=hooks)
File "/ccs/proj/bif120/deepforce/virtualenvs/ibm_wml_ce-1.6.2-2/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 3035, in train
rendezvous.raise_errors()
File "/ccs/proj/bif120/deepforce/virtualenvs/ibm_wml_ce-1.6.2-2/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/tpu/error_handling.py", line 136, in raise_errors
six.reraise(typ, value, traceback)
File "/ccs/proj/bif120/deepforce/virtualenvs/ibm_wml_ce-1.6.2-2/lib/python3.6/site-packages/six.py", line 693, in reraise
raise value
File "/ccs/proj/bif120/deepforce/virtualenvs/ibm_wml_ce-1.6.2-2/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 3030, in train
saving_listeners=saving_listeners)
File "/ccs/proj/bif120/deepforce/virtualenvs/ibm_wml_ce-1.6.2-2/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 370, in train
loss = self._train_model(input_fn, hooks, saving_listeners)
File "/ccs/proj/bif120/deepforce/virtualenvs/ibm_wml_ce-1.6.2-2/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1161, in _train_model
return self._train_model_default(input_fn, hooks, saving_listeners)
File "/ccs/proj/bif120/deepforce/virtualenvs/ibm_wml_ce-1.6.2-2/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1195, in _train_model_default
saving_listeners)
File "/ccs/proj/bif120/deepforce/virtualenvs/ibm_wml_ce-1.6.2-2/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1490, in _train_with_estimator_spec
log_step_count_steps=log_step_count_steps) as mon_sess:
File "/ccs/proj/bif120/deepforce/virtualenvs/ibm_wml_ce-1.6.2-2/lib/python3.6/site-packages/tensorflow_core/python/training/monitored_session.py", line 584, in MonitoredTrainingSession
stop_grace_period_secs=stop_grace_period_secs)
File "/ccs/proj/bif120/deepforce/virtualenvs/ibm_wml_ce-1.6.2-2/lib/python3.6/site-packages/tensorflow_core/python/training/monitored_session.py", line 1014, in __init__
stop_grace_period_secs=stop_grace_period_secs)
File "/ccs/proj/bif120/deepforce/virtualenvs/ibm_wml_ce-1.6.2-2/lib/python3.6/site-packages/tensorflow_core/python/training/monitored_session.py", line 713, in __init__
h.begin()
File "/ccs/proj/bif120/deepforce/virtualenvs/ibm_wml_ce-1.6.2-2/lib/python3.6/site-packages/tflms-2.0.2-py3.6.egg/tensorflow_large_model_support/lms.py", line 1538, in begin
File "/ccs/proj/bif120/deepforce/virtualenvs/ibm_wml_ce-1.6.2-2/lib/python3.6/site-packages/tflms-2.0.2-py3.6.egg/tensorflow_large_model_support/lms.py", line 461, in run
File "/ccs/proj/bif120/deepforce/virtualenvs/ibm_wml_ce-1.6.2-2/lib/python3.6/site-packages/tflms-2.0.2-py3.6.egg/tensorflow_large_model_support/lms.py", line 972, in _validate_parameters
ValueError: Auto-tuning was unable to find a value for swapout_threshold. Please specify it manually.
I0214 16:11:41.501451 35184372402448 lms.py:1275] [LMS][0] Original categorized topological sort has 6815 levels.
I0214 16:11:42.030883 35184372402448 lms.py:1275] [LMS][0] Searching values for parameters: swapout_threshold, swapin_ahead, swapin_groupby and sync_mode.
E0214 16:11:42.097813 35184372402448 error_handling.py:75] Error recorded from training_loop: Auto-tuning was unable to find a value for swapout_threshold. Please specify it manually.
I0214 16:11:42.098005 35184372402448 error_handling.py:101] training_loop marked as finished
W0214 16:11:42.098096 35184372402448 error_handling.py:135] Reraising captured error
I0214 16:11:42.101529 35184372402448 lms.py:1275] [LMS][0] [Simulator] Available memory for simulation: 10.87 GiB (memory ratio: 0.8)
I0214 16:11:42.101739 35184372402448 lms.py:1275] [LMS][0] [Simulator] Available CPU memory for simulation: 64.0 GiB
Traceback (most recent call last):
File "run_pretraining_hvd.py", line 542, in <module>
tf.app.run()
File "/ccs/proj/bif120/deepforce/virtualenvs/ibm_wml_ce-1.6.2-2/lib/python3.6/site-packages/tensorflow_core/python/platform/app.py", line 40, in run
_run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef)
File "/ccs/proj/bif120/deepforce/virtualenvs/ibm_wml_ce-1.6.2-2/lib/python3.6/site-packages/absl/app.py", line 300, in run
_run_main(main, args)
File "/ccs/proj/bif120/deepforce/virtualenvs/ibm_wml_ce-1.6.2-2/lib/python3.6/site-packages/absl/app.py", line 251, in _run_main
sys.exit(main(argv))
File "run_pretraining_hvd.py", line 515, in main
estimator.train(input_fn=train_input_fn, max_steps=FLAGS.num_train_steps, hooks=hooks)
File "/ccs/proj/bif120/deepforce/virtualenvs/ibm_wml_ce-1.6.2-2/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 3035, in train
rendezvous.raise_errors()
File "/ccs/proj/bif120/deepforce/virtualenvs/ibm_wml_ce-1.6.2-2/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/tpu/error_handling.py", line 136, in raise_errors
six.reraise(typ, value, traceback)
File "/ccs/proj/bif120/deepforce/virtualenvs/ibm_wml_ce-1.6.2-2/lib/python3.6/site-packages/six.py", line 693, in reraise
raise value
File "/ccs/proj/bif120/deepforce/virtualenvs/ibm_wml_ce-1.6.2-2/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 3030, in train
saving_listeners=saving_listeners)
File "/ccs/proj/bif120/deepforce/virtualenvs/ibm_wml_ce-1.6.2-2/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 370, in train
loss = self._train_model(input_fn, hooks, saving_listeners)
File "/ccs/proj/bif120/deepforce/virtualenvs/ibm_wml_ce-1.6.2-2/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1161, in _train_model
return self._train_model_default(input_fn, hooks, saving_listeners)
File "/ccs/proj/bif120/deepforce/virtualenvs/ibm_wml_ce-1.6.2-2/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1195, in _train_model_default
saving_listeners)
File "/ccs/proj/bif120/deepforce/virtualenvs/ibm_wml_ce-1.6.2-2/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1490, in _train_with_estimator_spec
log_step_count_steps=log_step_count_steps) as mon_sess:
File "/ccs/proj/bif120/deepforce/virtualenvs/ibm_wml_ce-1.6.2-2/lib/python3.6/site-packages/tensorflow_core/python/training/monitored_session.py", line 584, in MonitoredTrainingSession
stop_grace_period_secs=stop_grace_period_secs)
File "/ccs/proj/bif120/deepforce/virtualenvs/ibm_wml_ce-1.6.2-2/lib/python3.6/site-packages/tensorflow_core/python/training/monitored_session.py", line 1014, in __init__
stop_grace_period_secs=stop_grace_period_secs)
File "/ccs/proj/bif120/deepforce/virtualenvs/ibm_wml_ce-1.6.2-2/lib/python3.6/site-packages/tensorflow_core/python/training/monitored_session.py", line 713, in __init__
h.begin()
File "/ccs/proj/bif120/deepforce/virtualenvs/ibm_wml_ce-1.6.2-2/lib/python3.6/site-packages/tflms-2.0.2-py3.6.egg/tensorflow_large_model_support/lms.py", line 1538, in begin
File "/ccs/proj/bif120/deepforce/virtualenvs/ibm_wml_ce-1.6.2-2/lib/python3.6/site-packages/tflms-2.0.2-py3.6.egg/tensorflow_large_model_support/lms.py", line 461, in run
File "/ccs/proj/bif120/deepforce/virtualenvs/ibm_wml_ce-1.6.2-2/lib/python3.6/site-packages/tflms-2.0.2-py3.6.egg/tensorflow_large_model_support/lms.py", line 972, in _validate_parameters
ValueError: Auto-tuning was unable to find a value for swapout_threshold. Please specify it manually.
I0214 16:11:43.265153 35184372402448 lms.py:1275] [LMS][0] Original categorized topological sort has 6815 levels.
I0214 16:11:43.779389 35184372402448 lms.py:1275] [LMS][0] Original categorized topological sort has 6815 levels.
I0214 16:11:43.794763 35184372402448 lms.py:1275] [LMS][0] Searching values for parameters: swapout_threshold, swapin_ahead, swapin_groupby and sync_mode.
I0214 16:11:43.864676 35184372402448 lms.py:1275] [LMS][0] [Simulator] Available memory for simulation: 10.87 GiB (memory ratio: 0.8)
I0214 16:11:43.864907 35184372402448 lms.py:1275] [LMS][0] [Simulator] Available CPU memory for simulation: 64.0 GiB
E0214 16:11:43.912534 35184372402448 error_handling.py:75] Error recorded from training_loop: Auto-tuning was unable to find a value for swapout_threshold. Please specify it manually.
I0214 16:11:43.912682 35184372402448 error_handling.py:101] training_loop marked as finished
W0214 16:11:43.912794 35184372402448 error_handling.py:135] Reraising captured error
Traceback (most recent call last):
File "run_pretraining_hvd.py", line 542, in <module>
tf.app.run()
File "/ccs/proj/bif120/deepforce/virtualenvs/ibm_wml_ce-1.6.2-2/lib/python3.6/site-packages/tensorflow_core/python/platform/app.py", line 40, in run
_run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef)
File "/ccs/proj/bif120/deepforce/virtualenvs/ibm_wml_ce-1.6.2-2/lib/python3.6/site-packages/absl/app.py", line 300, in run
_run_main(main, args)
File "/ccs/proj/bif120/deepforce/virtualenvs/ibm_wml_ce-1.6.2-2/lib/python3.6/site-packages/absl/app.py", line 251, in _run_main
sys.exit(main(argv))
File "run_pretraining_hvd.py", line 515, in main
estimator.train(input_fn=train_input_fn, max_steps=FLAGS.num_train_steps, hooks=hooks)
File "/ccs/proj/bif120/deepforce/virtualenvs/ibm_wml_ce-1.6.2-2/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 3035, in train
rendezvous.raise_errors()
File "/ccs/proj/bif120/deepforce/virtualenvs/ibm_wml_ce-1.6.2-2/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/tpu/error_handling.py", line 136, in raise_errors
six.reraise(typ, value, traceback)
File "/ccs/proj/bif120/deepforce/virtualenvs/ibm_wml_ce-1.6.2-2/lib/python3.6/site-packages/six.py", line 693, in reraise
raise value
File "/ccs/proj/bif120/deepforce/virtualenvs/ibm_wml_ce-1.6.2-2/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 3030, in train
saving_listeners=saving_listeners)
File "/ccs/proj/bif120/deepforce/virtualenvs/ibm_wml_ce-1.6.2-2/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 370, in train
loss = self._train_model(input_fn, hooks, saving_listeners)
File "/ccs/proj/bif120/deepforce/virtualenvs/ibm_wml_ce-1.6.2-2/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1161, in _train_model
return self._train_model_default(input_fn, hooks, saving_listeners)
File "/ccs/proj/bif120/deepforce/virtualenvs/ibm_wml_ce-1.6.2-2/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1195, in _train_model_default
saving_listeners)
File "/ccs/proj/bif120/deepforce/virtualenvs/ibm_wml_ce-1.6.2-2/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1490, in _train_with_estimator_spec
log_step_count_steps=log_step_count_steps) as mon_sess:
File "/ccs/proj/bif120/deepforce/virtualenvs/ibm_wml_ce-1.6.2-2/lib/python3.6/site-packages/tensorflow_core/python/training/monitored_session.py", line 584, in MonitoredTrainingSession
stop_grace_period_secs=stop_grace_period_secs)
File "/ccs/proj/bif120/deepforce/virtualenvs/ibm_wml_ce-1.6.2-2/lib/python3.6/site-packages/tensorflow_core/python/training/monitored_session.py", line 1014, in __init__
stop_grace_period_secs=stop_grace_period_secs)
File "/ccs/proj/bif120/deepforce/virtualenvs/ibm_wml_ce-1.6.2-2/lib/python3.6/site-packages/tensorflow_core/python/training/monitored_session.py", line 713, in __init__
h.begin()
File "/ccs/proj/bif120/deepforce/virtualenvs/ibm_wml_ce-1.6.2-2/lib/python3.6/site-packages/tflms-2.0.2-py3.6.egg/tensorflow_large_model_support/lms.py", line 1538, in begin
File "/ccs/proj/bif120/deepforce/virtualenvs/ibm_wml_ce-1.6.2-2/lib/python3.6/site-packages/tflms-2.0.2-py3.6.egg/tensorflow_large_model_support/lms.py", line 461, in run
File "/ccs/proj/bif120/deepforce/virtualenvs/ibm_wml_ce-1.6.2-2/lib/python3.6/site-packages/tflms-2.0.2-py3.6.egg/tensorflow_large_model_support/lms.py", line 972, in _validate_parameters
ValueError: Auto-tuning was unable to find a value for swapout_threshold. Please specify it manually.
--------------------------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
I0214 16:11:44.300842 35184372402448 lms.py:1275] [LMS][0] Searching values for parameters: swapout_threshold, swapin_ahead, swapin_groupby and sync_mode.
I0214 16:11:44.370010 35184372402448 lms.py:1275] [LMS][0] [Simulator] Available memory for simulation: 10.87 GiB (memory ratio: 0.8)
I0214 16:11:44.370211 35184372402448 lms.py:1275] [LMS][0] [Simulator] Available CPU memory for simulation: 64.0 GiB
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:
Process name: [[53578,1],4]
Exit code: 1
Thanks @agemagician, the error you are now getting is not the Horovod-LMS first touch issue. It is this, "Auto-tuning was unable to find a value for swapout_threshold" which means that LMS was unable to find any values for the tunable parameters using its auto-tuning simulator and you need to manually specify them, which you have already done to successfully avoid the Horovod-LMS issue in previous attempts.
@agemagician I'm sorry, I didn't think about it pulling down ddlrun
from the EA channel too.
@bethune-bryant unfortunately, after I did as you recommended, I got an error everytime I try to run ddlrun:
The version of ddlrun
you were pulling from the early access channel has jsrun integration that is not yet supported by the jsrun version on Summit. To get around that error with the ddlrun
from EA you can add the argument:
ddlrun --launcher "mpirun" ...
@smatzek @bethune-bryant Thanks a lot for all your help. You are the best :)
Hello,
I am training bert model on SUMMIT using Horovod, DDL and ibm-wml-ce/1.6.2-2 , and it is working without an issue. However, when I tried to enable LMS it always crash as follows:
This is how I did modify the code:
Any idea how to solve this issue?