LMS + Horovod on SUMMIT

agemagician commented 4 years ago

Hello,

I am training bert model on SUMMIT using Horovod, DDL and ibm-wml-ce/1.6.2-2 , and it is working without an issue. However, when I tried to enable LMS it always crash as follows:

I0213 11:10:23.987161 35184372402816 estimator.py:1150] Done calling model_fn.
I0213 11:10:23.988795 35184372402816 basic_session_run_hooks.py:541] Create CheckpointSaverHook.
I0213 11:10:24.120311 35184372402816 estimator.py:1150] Done calling model_fn.
I0213 11:10:24.121983 35184372402816 basic_session_run_hooks.py:541] Create CheckpointSaverHook.
I0213 11:10:24.146433 35184372402816 estimator.py:1150] Done calling model_fn.
I0213 11:10:24.148114 35184372402816 basic_session_run_hooks.py:541] Create CheckpointSaverHook.
I0213 11:10:24.258707 35184372402816 estimator.py:1150] Done calling model_fn.
I0213 11:10:24.260442 35184372402816 basic_session_run_hooks.py:541] Create CheckpointSaverHook.
I0213 11:10:24.284729 35184372402816 estimator.py:1150] Done calling model_fn.
I0213 11:10:24.286435 35184372402816 basic_session_run_hooks.py:541] Create CheckpointSaverHook.
I0213 11:10:24.464406 35184372402816 estimator.py:1150] Done calling model_fn.
I0213 11:10:24.466135 35184372402816 basic_session_run_hooks.py:541] Create CheckpointSaverHook.
I0213 11:11:35.217139 35184372402816 lms.py:1275] [LMS][0] Editing model for LMS
I0213 11:11:35.917179 35184372402816 lms.py:1275] [LMS][0] Editing model for LMS
I0213 11:11:36.026071 35184372402816 lms.py:1275] [LMS][0] Editing model for LMS
I0213 11:11:36.100521 35184372402816 lms.py:1275] [LMS][0] Editing model for LMS
I0213 11:11:36.647722 35184372402816 lms.py:1275] [LMS][0] The graph has 42659 vertices and 49693 edges.
I0213 11:11:36.647912 35184372402816 lms.py:1275] [LMS][0] The graph has 1163.38 MiB of learning parameters.
I0213 11:11:36.647990 35184372402816 lms.py:1275] [LMS][0] The largest GPU operation is bert/encoder/layer_0/attention/self/dropout/mul_1 consuming 0.09 GiB
I0213 11:11:36.775090 35184372402816 lms.py:1275] [LMS][0] Editing model for LMS
I0213 11:11:37.355236 35184372402816 lms.py:1275] [LMS][0] The graph has 42659 vertices and 49693 edges.
I0213 11:11:37.355421 35184372402816 lms.py:1275] [LMS][0] The graph has 1163.38 MiB of learning parameters.
I0213 11:11:37.355504 35184372402816 lms.py:1275] [LMS][0] The largest GPU operation is bert/encoder/layer_0/attention/self/dropout/mul_1 consuming 0.09 GiB
I0213 11:11:37.482123 35184372402816 lms.py:1275] [LMS][0] The graph has 42659 vertices and 49693 edges.
I0213 11:11:37.482304 35184372402816 lms.py:1275] [LMS][0] The graph has 1163.38 MiB of learning parameters.
I0213 11:11:37.482389 35184372402816 lms.py:1275] [LMS][0] The largest GPU operation is bert/encoder/layer_0/attention/self/dropout/mul_1 consuming 0.09 GiB
I0213 11:11:37.560845 35184372402816 lms.py:1275] [LMS][0] The graph has 42659 vertices and 49693 edges.
I0213 11:11:37.561020 35184372402816 lms.py:1275] [LMS][0] The graph has 1163.38 MiB of learning parameters.
I0213 11:11:37.561104 35184372402816 lms.py:1275] [LMS][0] The largest GPU operation is bert/encoder/layer_0/attention/self/dropout/mul_1 consuming 0.09 GiB
I0213 11:11:37.572148 35184372402816 lms.py:1275] [LMS][0] Editing model for LMS
I0213 11:11:38.248766 35184372402816 lms.py:1275] [LMS][0] The graph has 42659 vertices and 49693 edges.
I0213 11:11:38.248956 35184372402816 lms.py:1275] [LMS][0] The graph has 1163.38 MiB of learning parameters.
I0213 11:11:38.249040 35184372402816 lms.py:1275] [LMS][0] The largest GPU operation is bert/encoder/layer_0/attention/self/dropout/mul_1 consuming 0.09 GiB
I0213 11:11:39.025905 35184372402816 lms.py:1275] [LMS][0] The graph has 42659 vertices and 49693 edges.
I0213 11:11:39.026104 35184372402816 lms.py:1275] [LMS][0] The graph has 1163.38 MiB of learning parameters.
I0213 11:11:39.026182 35184372402816 lms.py:1275] [LMS][0] The largest GPU operation is bert/encoder/layer_0/attention/self/dropout/mul_1 consuming 0.09 GiB
I0213 11:13:20.344918 35184372402816 lms.py:1275] [LMS][0] Original categorized topological sort has 5486 levels.
I0213 11:13:20.714986 35184372402816 lms.py:1275] [LMS][0] Searching values for parameters: swapout_threshold, swapin_ahead, swapin_groupby and sync_mode. 
2020-02-13 11:13:20.772608: I tensorflow/core/platform/profile_utils/cpu_utils.cc:101] CPU Frequency: 3450000000 Hz
2020-02-13 11:13:20.774204: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x169d436d0 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2020-02-13 11:13:20.774231: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version
2020-02-13 11:13:20.776350: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1
2020-02-13 11:13:20.818419: W tensorflow/compiler/xla/service/platform_util.cc:256] unable to create StreamExecutor for CUDA:3: failed initializing StreamExecutor for CUDA device ordinal 3: Internal: failed call to cuDevicePrimaryCtxRetain: CUDA_ERROR_INVALID_DEVICE: invalid device ordinal
2020-02-13 11:13:20.818427: W tensorflow/compiler/xla/service/platform_util.cc:256] unable to create StreamExecutor for CUDA:0: failed initializing StreamExecutor for CUDA device ordinal 0: Internal: failed call to cuDevicePrimaryCtxRetain: CUDA_ERROR_INVALID_DEVICE: invalid device ordinal
2020-02-13 11:13:20.818475: W tensorflow/compiler/xla/service/platform_util.cc:256] unable to create StreamExecutor for CUDA:1: failed initializing StreamExecutor for CUDA device ordinal 1: Internal: failed call to cuDevicePrimaryCtxRetain: CUDA_ERROR_INVALID_DEVICE: invalid device ordinal
2020-02-13 11:13:20.818589: W tensorflow/compiler/xla/service/platform_util.cc:256] unable to create StreamExecutor for CUDA:5: failed initializing StreamExecutor for CUDA device ordinal 5: Internal: failed call to cuDevicePrimaryCtxRetain: CUDA_ERROR_INVALID_DEVICE: invalid device ordinal
2020-02-13 11:13:20.818624: W tensorflow/compiler/xla/service/platform_util.cc:256] unable to create StreamExecutor for CUDA:4: failed initializing StreamExecutor for CUDA device ordinal 4: Internal: failed call to cuDevicePrimaryCtxRetain: CUDA_ERROR_INVALID_DEVICE: invalid device ordinal
2020-02-13 11:13:20.821438: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x16a18e810 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2020-02-13 11:13:20.821457: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Tesla V100-SXM2-16GB, Compute Capability 7.0
E0213 11:13:20.821823 35184372402816 error_handling.py:75] Error recorded from training_loop: device CUDA:0 not supported by XLA service
    while setting up XLA_GPU_JIT device number 0
I0213 11:13:20.821963 35184372402816 error_handling.py:101] training_loop marked as finished
W0213 11:13:20.822050 35184372402816 error_handling.py:135] Reraising captured error
Traceback (most recent call last):
  File "run_pretraining_hvd.py", line 514, in <module>
    tf.app.run()
  File "/sw/summit/ibm-wml-ce/anaconda-base/envs/ibm-wml-ce-1.6.2-2/lib/python3.6/site-packages/tensorflow_core/python/platform/app.py", line 40, in run
    _run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef)
  File "/sw/summit/ibm-wml-ce/anaconda-base/envs/ibm-wml-ce-1.6.2-2/lib/python3.6/site-packages/absl/app.py", line 300, in run
    _run_main(main, args)
  File "/sw/summit/ibm-wml-ce/anaconda-base/envs/ibm-wml-ce-1.6.2-2/lib/python3.6/site-packages/absl/app.py", line 251, in _run_main
    sys.exit(main(argv))
  File "run_pretraining_hvd.py", line 487, in main
    estimator.train(input_fn=train_input_fn, max_steps=FLAGS.num_train_steps, hooks=hooks)
  File "/sw/summit/ibm-wml-ce/anaconda-base/envs/ibm-wml-ce-1.6.2-2/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 3035, in train
    rendezvous.raise_errors()
  File "/sw/summit/ibm-wml-ce/anaconda-base/envs/ibm-wml-ce-1.6.2-2/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/tpu/error_handling.py", line 136, in raise_errors
    six.reraise(typ, value, traceback)
  File "/sw/summit/ibm-wml-ce/anaconda-base/envs/ibm-wml-ce-1.6.2-2/lib/python3.6/site-packages/six.py", line 693, in reraise
    raise value
  File "/sw/summit/ibm-wml-ce/anaconda-base/envs/ibm-wml-ce-1.6.2-2/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 3030, in train
    saving_listeners=saving_listeners)
  File "/sw/summit/ibm-wml-ce/anaconda-base/envs/ibm-wml-ce-1.6.2-2/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 370, in train
    loss = self._train_model(input_fn, hooks, saving_listeners)
  File "/sw/summit/ibm-wml-ce/anaconda-base/envs/ibm-wml-ce-1.6.2-2/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1161, in _train_model
    return self._train_model_default(input_fn, hooks, saving_listeners)
  File "/sw/summit/ibm-wml-ce/anaconda-base/envs/ibm-wml-ce-1.6.2-2/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1195, in _train_model_default
    saving_listeners)
  File "/sw/summit/ibm-wml-ce/anaconda-base/envs/ibm-wml-ce-1.6.2-2/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1490, in _train_with_estimator_spec
    log_step_count_steps=log_step_count_steps) as mon_sess:
  File "/sw/summit/ibm-wml-ce/anaconda-base/envs/ibm-wml-ce-1.6.2-2/lib/python3.6/site-packages/tensorflow_core/python/training/monitored_session.py", line 584, in MonitoredTrainingSession
    stop_grace_period_secs=stop_grace_period_secs)
  File "/sw/summit/ibm-wml-ce/anaconda-base/envs/ibm-wml-ce-1.6.2-2/lib/python3.6/site-packages/tensorflow_core/python/training/monitored_session.py", line 1014, in __init__
    stop_grace_period_secs=stop_grace_period_secs)
  File "/sw/summit/ibm-wml-ce/anaconda-base/envs/ibm-wml-ce-1.6.2-2/lib/python3.6/site-packages/tensorflow_core/python/training/monitored_session.py", line 713, in __init__
    h.begin()
  File "/sw/summit/ibm-wml-ce/anaconda-base/envs/ibm-wml-ce-1.6.2-2/lib/python3.6/site-packages/tflms-2.0.2-py3.6.egg/tensorflow_large_model_support/lms.py", line 1538, in begin
  File "/sw/summit/ibm-wml-ce/anaconda-base/envs/ibm-wml-ce-1.6.2-2/lib/python3.6/site-packages/tflms-2.0.2-py3.6.egg/tensorflow_large_model_support/lms.py", line 457, in run
  File "/sw/summit/ibm-wml-ce/anaconda-base/envs/ibm-wml-ce-1.6.2-2/lib/python3.6/site-packages/tflms-2.0.2-py3.6.egg/tensorflow_large_model_support/lms.py", line 1041, in _search_params
  File "/sw/summit/ibm-wml-ce/anaconda-base/envs/ibm-wml-ce-1.6.2-2/lib/python3.6/site-packages/tflms-2.0.2-py3.6.egg/tensorflow_large_model_support/simulator.py", line 92, in __init__
  File "/sw/summit/ibm-wml-ce/anaconda-base/envs/ibm-wml-ce-1.6.2-2/lib/python3.6/site-packages/tflms-2.0.2-py3.6.egg/tensorflow_large_model_support/simulator.py", line 102, in _initialize
  File "/sw/summit/ibm-wml-ce/anaconda-base/envs/ibm-wml-ce-1.6.2-2/lib/python3.6/site-packages/tensorflow_core/python/client/device_lib.py", line 41, in list_local_devices
    for s in pywrap_tensorflow.list_devices(session_config=session_config)
  File "/sw/summit/ibm-wml-ce/anaconda-base/envs/ibm-wml-ce-1.6.2-2/lib/python3.6/site-packages/tensorflow_core/python/pywrap_tensorflow_internal.py", line 2249, in list_devices
    return ListDevices()
tensorflow.python.framework.errors_impl.InvalidArgumentError: device CUDA:0 not supported by XLA service
    while setting up XLA_GPU_JIT device number 0
I0213 11:13:21.037585 35184372402816 lms.py:1275] [LMS][0] Original categorized topological sort has 5486 levels.
I0213 11:13:21.281503 35184372402816 lms.py:1275] [LMS][0] Original categorized topological sort has 5486 levels.
I0213 11:13:21.410260 35184372402816 lms.py:1275] [LMS][0] Searching values for parameters: swapout_threshold, swapin_ahead, swapin_groupby and sync_mode. 
2020-02-13 11:13:21.468180: I tensorflow/core/platform/profile_utils/cpu_utils.cc:101] CPU Frequency: 3450000000 Hz
2020-02-13 11:13:21.469855: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x135766970 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2020-02-13 11:13:21.469877: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version
2020-02-13 11:13:21.471944: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1
2020-02-13 11:13:21.473161: W tensorflow/compiler/xla/service/platform_util.cc:256] unable to create StreamExecutor for CUDA:1: failed initializing StreamExecutor for CUDA device ordinal 1: Internal: failed call to cuDevicePrimaryCtxRetain: CUDA_ERROR_INVALID_DEVICE: invalid device ordinal
2020-02-13 11:13:21.473167: W tensorflow/compiler/xla/service/platform_util.cc:256] unable to create StreamExecutor for CUDA:0: failed initializing StreamExecutor for CUDA device ordinal 0: Internal: failed call to cuDevicePrimaryCtxRetain: CUDA_ERROR_INVALID_DEVICE: invalid device ordinal
2020-02-13 11:13:21.473174: W tensorflow/compiler/xla/service/platform_util.cc:256] unable to create StreamExecutor for CUDA:3: failed initializing StreamExecutor for CUDA device ordinal 3: Internal: failed call to cuDevicePrimaryCtxRetain: CUDA_ERROR_INVALID_DEVICE: invalid device ordinal
2020-02-13 11:13:21.473295: W tensorflow/compiler/xla/service/platform_util.cc:256] unable to create StreamExecutor for CUDA:2: failed initializing StreamExecutor for CUDA device ordinal 2: Internal: failed call to cuDevicePrimaryCtxRetain: CUDA_ERROR_INVALID_DEVICE: invalid device ordinal
2020-02-13 11:13:21.473335: W tensorflow/compiler/xla/service/platform_util.cc:256] unable to create StreamExecutor for CUDA:4: failed initializing StreamExecutor for CUDA device ordinal 4: Internal: failed call to cuDevicePrimaryCtxRetain: CUDA_ERROR_INVALID_DEVICE: invalid device ordinal
2020-02-13 11:13:21.476319: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x135bb1ab0 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2020-02-13 11:13:21.476331: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Tesla V100-SXM2-16GB, Compute Capability 7.0
E0213 11:13:21.476681 35184372402816 error_handling.py:75] Error recorded from training_loop: device CUDA:0 not supported by XLA service
    while setting up XLA_GPU_JIT device number 0
I0213 11:13:21.476844 35184372402816 error_handling.py:101] training_loop marked as finished
W0213 11:13:21.476929 35184372402816 error_handling.py:135] Reraising captured error
Traceback (most recent call last):
  File "run_pretraining_hvd.py", line 514, in <module>
    tf.app.run()
  File "/sw/summit/ibm-wml-ce/anaconda-base/envs/ibm-wml-ce-1.6.2-2/lib/python3.6/site-packages/tensorflow_core/python/platform/app.py", line 40, in run
    _run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef)
  File "/sw/summit/ibm-wml-ce/anaconda-base/envs/ibm-wml-ce-1.6.2-2/lib/python3.6/site-packages/absl/app.py", line 300, in run
    _run_main(main, args)
  File "/sw/summit/ibm-wml-ce/anaconda-base/envs/ibm-wml-ce-1.6.2-2/lib/python3.6/site-packages/absl/app.py", line 251, in _run_main
    sys.exit(main(argv))
  File "run_pretraining_hvd.py", line 487, in main
    estimator.train(input_fn=train_input_fn, max_steps=FLAGS.num_train_steps, hooks=hooks)
  File "/sw/summit/ibm-wml-ce/anaconda-base/envs/ibm-wml-ce-1.6.2-2/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 3035, in train
    rendezvous.raise_errors()
  File "/sw/summit/ibm-wml-ce/anaconda-base/envs/ibm-wml-ce-1.6.2-2/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/tpu/error_handling.py", line 136, in raise_errors
    six.reraise(typ, value, traceback)
  File "/sw/summit/ibm-wml-ce/anaconda-base/envs/ibm-wml-ce-1.6.2-2/lib/python3.6/site-packages/six.py", line 693, in reraise
    raise value
  File "/sw/summit/ibm-wml-ce/anaconda-base/envs/ibm-wml-ce-1.6.2-2/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 3030, in train
    saving_listeners=saving_listeners)
  File "/sw/summit/ibm-wml-ce/anaconda-base/envs/ibm-wml-ce-1.6.2-2/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 370, in train
    loss = self._train_model(input_fn, hooks, saving_listeners)
  File "/sw/summit/ibm-wml-ce/anaconda-base/envs/ibm-wml-ce-1.6.2-2/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1161, in _train_model
    return self._train_model_default(input_fn, hooks, saving_listeners)
  File "/sw/summit/ibm-wml-ce/anaconda-base/envs/ibm-wml-ce-1.6.2-2/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1195, in _train_model_default
    saving_listeners)
  File "/sw/summit/ibm-wml-ce/anaconda-base/envs/ibm-wml-ce-1.6.2-2/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1490, in _train_with_estimator_spec
    log_step_count_steps=log_step_count_steps) as mon_sess:
  File "/sw/summit/ibm-wml-ce/anaconda-base/envs/ibm-wml-ce-1.6.2-2/lib/python3.6/site-packages/tensorflow_core/python/training/monitored_session.py", line 584, in MonitoredTrainingSession
    stop_grace_period_secs=stop_grace_period_secs)
  File "/sw/summit/ibm-wml-ce/anaconda-base/envs/ibm-wml-ce-1.6.2-2/lib/python3.6/site-packages/tensorflow_core/python/training/monitored_session.py", line 1014, in __init__
    stop_grace_period_secs=stop_grace_period_secs)
  File "/sw/summit/ibm-wml-ce/anaconda-base/envs/ibm-wml-ce-1.6.2-2/lib/python3.6/site-packages/tensorflow_core/python/training/monitored_session.py", line 713, in __init__
    h.begin()
  File "/sw/summit/ibm-wml-ce/anaconda-base/envs/ibm-wml-ce-1.6.2-2/lib/python3.6/site-packages/tflms-2.0.2-py3.6.egg/tensorflow_large_model_support/lms.py", line 1538, in begin
  File "/sw/summit/ibm-wml-ce/anaconda-base/envs/ibm-wml-ce-1.6.2-2/lib/python3.6/site-packages/tflms-2.0.2-py3.6.egg/tensorflow_large_model_support/lms.py", line 457, in run
  File "/sw/summit/ibm-wml-ce/anaconda-base/envs/ibm-wml-ce-1.6.2-2/lib/python3.6/site-packages/tflms-2.0.2-py3.6.egg/tensorflow_large_model_support/lms.py", line 1041, in _search_params
  File "/sw/summit/ibm-wml-ce/anaconda-base/envs/ibm-wml-ce-1.6.2-2/lib/python3.6/site-packages/tflms-2.0.2-py3.6.egg/tensorflow_large_model_support/simulator.py", line 92, in __init__
  File "/sw/summit/ibm-wml-ce/anaconda-base/envs/ibm-wml-ce-1.6.2-2/lib/python3.6/site-packages/tflms-2.0.2-py3.6.egg/tensorflow_large_model_support/simulator.py", line 102, in _initialize
  File "/sw/summit/ibm-wml-ce/anaconda-base/envs/ibm-wml-ce-1.6.2-2/lib/python3.6/site-packages/tensorflow_core/python/client/device_lib.py", line 41, in list_local_devices
    for s in pywrap_tensorflow.list_devices(session_config=session_config)
  File "/sw/summit/ibm-wml-ce/anaconda-base/envs/ibm-wml-ce-1.6.2-2/lib/python3.6/site-packages/tensorflow_core/python/pywrap_tensorflow_internal.py", line 2249, in list_devices
    return ListDevices()
tensorflow.python.framework.errors_impl.InvalidArgumentError: device CUDA:0 not supported by XLA service
    while setting up XLA_GPU_JIT device number 0
I0213 11:13:21.651515 35184372402816 lms.py:1275] [LMS][0] Searching values for parameters: swapout_threshold, swapin_ahead, swapin_groupby and sync_mode. 
2020-02-13 11:13:21.708091: I tensorflow/core/platform/profile_utils/cpu_utils.cc:101] CPU Frequency: 3450000000 Hz
2020-02-13 11:13:21.709625: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x135b54a90 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2020-02-13 11:13:21.709655: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version
2020-02-13 11:13:21.711710: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1
2020-02-13 11:13:21.713175: W tensorflow/compiler/xla/service/platform_util.cc:256] unable to create StreamExecutor for CUDA:5: failed initializing StreamExecutor for CUDA device ordinal 5: Internal: failed call to cuDevicePrimaryCtxRetain: CUDA_ERROR_INVALID_DEVICE: invalid device ordinal
2020-02-13 11:13:21.713175: W tensorflow/compiler/xla/service/platform_util.cc:256] unable to create StreamExecutor for CUDA:0: failed initializing StreamExecutor for CUDA device ordinal 0: Internal: failed call to cuDevicePrimaryCtxRetain: CUDA_ERROR_INVALID_DEVICE: invalid device ordinal
2020-02-13 11:13:21.713175: W tensorflow/compiler/xla/service/platform_util.cc:256] unable to create StreamExecutor for CUDA:2: failed initializing StreamExecutor for CUDA device ordinal 2: Internal: failed call to cuDevicePrimaryCtxRetain: CUDA_ERROR_INVALID_DEVICE: invalid device ordinal
2020-02-13 11:13:21.713275: W tensorflow/compiler/xla/service/platform_util.cc:256] unable to create StreamExecutor for CUDA:1: failed initializing StreamExecutor for CUDA device ordinal 1: Internal: failed call to cuDevicePrimaryCtxRetain: CUDA_ERROR_INVALID_DEVICE: invalid device ordinal
2020-02-13 11:13:21.713325: W tensorflow/compiler/xla/service/platform_util.cc:256] unable to create StreamExecutor for CUDA:3: failed initializing StreamExecutor for CUDA device ordinal 3: Internal: failed call to cuDevicePrimaryCtxRetain: CUDA_ERROR_INVALID_DEVICE: invalid device ordinal
2020-02-13 11:13:21.716522: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x135e91b10 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2020-02-13 11:13:21.716534: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Tesla V100-SXM2-16GB, Compute Capability 7.0
E0213 11:13:21.716881 35184372402816 error_handling.py:75] Error recorded from training_loop: device CUDA:0 not supported by XLA service
    while setting up XLA_GPU_JIT device number 0
I0213 11:13:21.717048 35184372402816 error_handling.py:101] training_loop marked as finished
W0213 11:13:21.717134 35184372402816 error_handling.py:135] Reraising captured error
Traceback (most recent call last):
  File "run_pretraining_hvd.py", line 514, in <module>
    tf.app.run()
  File "/sw/summit/ibm-wml-ce/anaconda-base/envs/ibm-wml-ce-1.6.2-2/lib/python3.6/site-packages/tensorflow_core/python/platform/app.py", line 40, in run
    _run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef)
  File "/sw/summit/ibm-wml-ce/anaconda-base/envs/ibm-wml-ce-1.6.2-2/lib/python3.6/site-packages/absl/app.py", line 300, in run
    _run_main(main, args)
  File "/sw/summit/ibm-wml-ce/anaconda-base/envs/ibm-wml-ce-1.6.2-2/lib/python3.6/site-packages/absl/app.py", line 251, in _run_main
    sys.exit(main(argv))
  File "run_pretraining_hvd.py", line 487, in main
    estimator.train(input_fn=train_input_fn, max_steps=FLAGS.num_train_steps, hooks=hooks)
  File "/sw/summit/ibm-wml-ce/anaconda-base/envs/ibm-wml-ce-1.6.2-2/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 3035, in train
    rendezvous.raise_errors()
  File "/sw/summit/ibm-wml-ce/anaconda-base/envs/ibm-wml-ce-1.6.2-2/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/tpu/error_handling.py", line 136, in raise_errors
    six.reraise(typ, value, traceback)
  File "/sw/summit/ibm-wml-ce/anaconda-base/envs/ibm-wml-ce-1.6.2-2/lib/python3.6/site-packages/six.py", line 693, in reraise
    raise value
  File "/sw/summit/ibm-wml-ce/anaconda-base/envs/ibm-wml-ce-1.6.2-2/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 3030, in train
    saving_listeners=saving_listeners)
  File "/sw/summit/ibm-wml-ce/anaconda-base/envs/ibm-wml-ce-1.6.2-2/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 370, in train
    loss = self._train_model(input_fn, hooks, saving_listeners)
  File "/sw/summit/ibm-wml-ce/anaconda-base/envs/ibm-wml-ce-1.6.2-2/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1161, in _train_model
    return self._train_model_default(input_fn, hooks, saving_listeners)
  File "/sw/summit/ibm-wml-ce/anaconda-base/envs/ibm-wml-ce-1.6.2-2/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1195, in _train_model_default
    saving_listeners)
  File "/sw/summit/ibm-wml-ce/anaconda-base/envs/ibm-wml-ce-1.6.2-2/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1490, in _train_with_estimator_spec
    log_step_count_steps=log_step_count_steps) as mon_sess:
  File "/sw/summit/ibm-wml-ce/anaconda-base/envs/ibm-wml-ce-1.6.2-2/lib/python3.6/site-packages/tensorflow_core/python/training/monitored_session.py", line 584, in MonitoredTrainingSession
    stop_grace_period_secs=stop_grace_period_secs)
  File "/sw/summit/ibm-wml-ce/anaconda-base/envs/ibm-wml-ce-1.6.2-2/lib/python3.6/site-packages/tensorflow_core/python/training/monitored_session.py", line 1014, in __init__
    stop_grace_period_secs=stop_grace_period_secs)
  File "/sw/summit/ibm-wml-ce/anaconda-base/envs/ibm-wml-ce-1.6.2-2/lib/python3.6/site-packages/tensorflow_core/python/training/monitored_session.py", line 713, in __init__
    h.begin()
  File "/sw/summit/ibm-wml-ce/anaconda-base/envs/ibm-wml-ce-1.6.2-2/lib/python3.6/site-packages/tflms-2.0.2-py3.6.egg/tensorflow_large_model_support/lms.py", line 1538, in begin
  File "/sw/summit/ibm-wml-ce/anaconda-base/envs/ibm-wml-ce-1.6.2-2/lib/python3.6/site-packages/tflms-2.0.2-py3.6.egg/tensorflow_large_model_support/lms.py", line 457, in run
  File "/sw/summit/ibm-wml-ce/anaconda-base/envs/ibm-wml-ce-1.6.2-2/lib/python3.6/site-packages/tflms-2.0.2-py3.6.egg/tensorflow_large_model_support/lms.py", line 1041, in _search_params
  File "/sw/summit/ibm-wml-ce/anaconda-base/envs/ibm-wml-ce-1.6.2-2/lib/python3.6/site-packages/tflms-2.0.2-py3.6.egg/tensorflow_large_model_support/simulator.py", line 92, in __init__
  File "/sw/summit/ibm-wml-ce/anaconda-base/envs/ibm-wml-ce-1.6.2-2/lib/python3.6/site-packages/tflms-2.0.2-py3.6.egg/tensorflow_large_model_support/simulator.py", line 102, in _initialize
  File "/sw/summit/ibm-wml-ce/anaconda-base/envs/ibm-wml-ce-1.6.2-2/lib/python3.6/site-packages/tensorflow_core/python/client/device_lib.py", line 41, in list_local_devices
    for s in pywrap_tensorflow.list_devices(session_config=session_config)
  File "/sw/summit/ibm-wml-ce/anaconda-base/envs/ibm-wml-ce-1.6.2-2/lib/python3.6/site-packages/tensorflow_core/python/pywrap_tensorflow_internal.py", line 2249, in list_devices
    return ListDevices()
tensorflow.python.framework.errors_impl.InvalidArgumentError: device CUDA:0 not supported by XLA service
    while setting up XLA_GPU_JIT device number 0
I0213 11:13:22.142907 35184372402816 lms.py:1275] [LMS][0] Original categorized topological sort has 5486 levels.
I0213 11:13:22.519449 35184372402816 lms.py:1275] [LMS][0] Searching values for parameters: swapout_threshold, swapin_ahead, swapin_groupby and sync_mode. 
2020-02-13 11:13:22.577900: I tensorflow/core/platform/profile_utils/cpu_utils.cc:101] CPU Frequency: 3450000000 Hz
2020-02-13 11:13:22.579598: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x172cc5fb0 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2020-02-13 11:13:22.579621: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version
2020-02-13 11:13:22.581727: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1
2020-02-13 11:13:22.583003: W tensorflow/compiler/xla/service/platform_util.cc:256] unable to create StreamExecutor for CUDA:3: failed initializing StreamExecutor for CUDA device ordinal 3: Internal: failed call to cuDevicePrimaryCtxRetain: CUDA_ERROR_INVALID_DEVICE: invalid device ordinal
2020-02-13 11:13:22.583042: W tensorflow/compiler/xla/service/platform_util.cc:256] unable to create StreamExecutor for CUDA:1: failed initializing StreamExecutor for CUDA device ordinal 1: Internal: failed call to cuDevicePrimaryCtxRetain: CUDA_ERROR_INVALID_DEVICE: invalid device ordinal
2020-02-13 11:13:22.583127: W tensorflow/compiler/xla/service/platform_util.cc:256] unable to create StreamExecutor for CUDA:2: failed initializing StreamExecutor for CUDA device ordinal 2: Internal: failed call to cuDevicePrimaryCtxRetain: CUDA_ERROR_INVALID_DEVICE: invalid device ordinal
2020-02-13 11:13:22.583169: W tensorflow/compiler/xla/service/platform_util.cc:256] unable to create StreamExecutor for CUDA:5: failed initializing StreamExecutor for CUDA device ordinal 5: Internal: failed call to cuDevicePrimaryCtxRetain: CUDA_ERROR_INVALID_DEVICE: invalid device ordinal
2020-02-13 11:13:22.583260: W tensorflow/compiler/xla/service/platform_util.cc:256] unable to create StreamExecutor for CUDA:4: failed initializing StreamExecutor for CUDA device ordinal 4: Internal: failed call to cuDevicePrimaryCtxRetain: CUDA_ERROR_INVALID_DEVICE: invalid device ordinal
2020-02-13 11:13:22.586370: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x1731110f0 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2020-02-13 11:13:22.586386: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Tesla V100-SXM2-16GB, Compute Capability 7.0
E0213 11:13:22.586922 35184372402816 error_handling.py:75] Error recorded from training_loop: Invalid device ordinal value (1). Valid range is [0, 0].
    while setting up XLA_GPU_JIT device number 1
I0213 11:13:22.587069 35184372402816 error_handling.py:101] training_loop marked as finished
W0213 11:13:22.587175 35184372402816 error_handling.py:135] Reraising captured error
Traceback (most recent call last):
  File "run_pretraining_hvd.py", line 514, in <module>
    tf.app.run()
  File "/sw/summit/ibm-wml-ce/anaconda-base/envs/ibm-wml-ce-1.6.2-2/lib/python3.6/site-packages/tensorflow_core/python/platform/app.py", line 40, in run
    _run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef)
  File "/sw/summit/ibm-wml-ce/anaconda-base/envs/ibm-wml-ce-1.6.2-2/lib/python3.6/site-packages/absl/app.py", line 300, in run
    _run_main(main, args)
  File "/sw/summit/ibm-wml-ce/anaconda-base/envs/ibm-wml-ce-1.6.2-2/lib/python3.6/site-packages/absl/app.py", line 251, in _run_main
    sys.exit(main(argv))
  File "run_pretraining_hvd.py", line 487, in main
    estimator.train(input_fn=train_input_fn, max_steps=FLAGS.num_train_steps, hooks=hooks)
  File "/sw/summit/ibm-wml-ce/anaconda-base/envs/ibm-wml-ce-1.6.2-2/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 3035, in train
    rendezvous.raise_errors()
  File "/sw/summit/ibm-wml-ce/anaconda-base/envs/ibm-wml-ce-1.6.2-2/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/tpu/error_handling.py", line 136, in raise_errors
    six.reraise(typ, value, traceback)
  File "/sw/summit/ibm-wml-ce/anaconda-base/envs/ibm-wml-ce-1.6.2-2/lib/python3.6/site-packages/six.py", line 693, in reraise
    raise value
  File "/sw/summit/ibm-wml-ce/anaconda-base/envs/ibm-wml-ce-1.6.2-2/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 3030, in train
    saving_listeners=saving_listeners)
  File "/sw/summit/ibm-wml-ce/anaconda-base/envs/ibm-wml-ce-1.6.2-2/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 370, in train
    loss = self._train_model(input_fn, hooks, saving_listeners)
  File "/sw/summit/ibm-wml-ce/anaconda-base/envs/ibm-wml-ce-1.6.2-2/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1161, in _train_model
    return self._train_model_default(input_fn, hooks, saving_listeners)
  File "/sw/summit/ibm-wml-ce/anaconda-base/envs/ibm-wml-ce-1.6.2-2/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1195, in _train_model_default
    saving_listeners)
  File "/sw/summit/ibm-wml-ce/anaconda-base/envs/ibm-wml-ce-1.6.2-2/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1490, in _train_with_estimator_spec
    log_step_count_steps=log_step_count_steps) as mon_sess:
  File "/sw/summit/ibm-wml-ce/anaconda-base/envs/ibm-wml-ce-1.6.2-2/lib/python3.6/site-packages/tensorflow_core/python/training/monitored_session.py", line 584, in MonitoredTrainingSession
    stop_grace_period_secs=stop_grace_period_secs)
  File "/sw/summit/ibm-wml-ce/anaconda-base/envs/ibm-wml-ce-1.6.2-2/lib/python3.6/site-packages/tensorflow_core/python/training/monitored_session.py", line 1014, in __init__
    stop_grace_period_secs=stop_grace_period_secs)
  File "/sw/summit/ibm-wml-ce/anaconda-base/envs/ibm-wml-ce-1.6.2-2/lib/python3.6/site-packages/tensorflow_core/python/training/monitored_session.py", line 713, in __init__
    h.begin()
  File "/sw/summit/ibm-wml-ce/anaconda-base/envs/ibm-wml-ce-1.6.2-2/lib/python3.6/site-packages/tflms-2.0.2-py3.6.egg/tensorflow_large_model_support/lms.py", line 1538, in begin
  File "/sw/summit/ibm-wml-ce/anaconda-base/envs/ibm-wml-ce-1.6.2-2/lib/python3.6/site-packages/tflms-2.0.2-py3.6.egg/tensorflow_large_model_support/lms.py", line 457, in run
  File "/sw/summit/ibm-wml-ce/anaconda-base/envs/ibm-wml-ce-1.6.2-2/lib/python3.6/site-packages/tflms-2.0.2-py3.6.egg/tensorflow_large_model_support/lms.py", line 1041, in _search_params
  File "/sw/summit/ibm-wml-ce/anaconda-base/envs/ibm-wml-ce-1.6.2-2/lib/python3.6/site-packages/tflms-2.0.2-py3.6.egg/tensorflow_large_model_support/simulator.py", line 92, in __init__
  File "/sw/summit/ibm-wml-ce/anaconda-base/envs/ibm-wml-ce-1.6.2-2/lib/python3.6/site-packages/tflms-2.0.2-py3.6.egg/tensorflow_large_model_support/simulator.py", line 102, in _initialize
  File "/sw/summit/ibm-wml-ce/anaconda-base/envs/ibm-wml-ce-1.6.2-2/lib/python3.6/site-packages/tensorflow_core/python/client/device_lib.py", line 41, in list_local_devices
    for s in pywrap_tensorflow.list_devices(session_config=session_config)
  File "/sw/summit/ibm-wml-ce/anaconda-base/envs/ibm-wml-ce-1.6.2-2/lib/python3.6/site-packages/tensorflow_core/python/pywrap_tensorflow_internal.py", line 2249, in list_devices
    return ListDevices()
tensorflow.python.framework.errors_impl.InvalidArgumentError: Invalid device ordinal value (1). Valid range is [0, 0].
    while setting up XLA_GPU_JIT device number 1
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
I0213 11:13:22.980506 35184372402816 lms.py:1275] [LMS][0] Original categorized topological sort has 5486 levels.
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[25780,1],2]
  Exit code:    1
--------------------------------------------------------------------------

This is how I did modify the code:

# If TPU is not available, this will fall back to normal Estimator on CPU
  # or GPU.
  estimator = tf.contrib.tpu.TPUEstimator(
      use_tpu=FLAGS.use_tpu,
      model_fn=model_fn,
      config=run_config,
      train_batch_size=FLAGS.train_batch_size,
      eval_batch_size=FLAGS.eval_batch_size)

  if FLAGS.do_train:
    tf.logging.info("***** Running training *****")
    tf.logging.info("  Batch size = %d", FLAGS.train_batch_size)
    train_input_fn = input_fn_builder(
        input_files=input_files,
        max_seq_length=FLAGS.max_seq_length,
        max_predictions_per_seq=FLAGS.max_predictions_per_seq,
        is_training=True)

    lms_hook = LMS()

    hooks = [lms_hook,hvd.BroadcastGlobalVariablesHook(0)]
    estimator.train(input_fn=train_input_fn, max_steps=FLAGS.num_train_steps, hooks=hooks)

Any idea how to solve this issue?

smatzek commented 4 years ago

The TensorFlow LMS version you are using (TFLMSv2 in WML CE 1.6.2) is a closed source implementation and its code is not in this repository. See here.

I would highly recommend updating to the new Large Model Support for TensorFlow 2 implementation which is open source in the master branch of this repository. This implementation is also available from the WML CE early access channel, and it is built into the TensorFlow 2.1.0 GPU package there. This link has more information about the WML CE early access channel. You should be able to install this version on Summit. The documentation on how to enable and use this version of LMS is in this github repository.

The issue you are hitting with TFLMSv2 is most likely an issue we recently found with TFLMSv2 and modern versions of Horovod. Earlier versions of Horovod from last year did not hit this. Horovod changed how it limits the GPU that each of the distributed processes sees. It now uses TensorFlow's GPU limiting functionality. Which part of the process touches the GPU first (LMS vs Horovod vs other TensorFlow code) now matters.

In your code you are using LMS auto-tuning which touches the GPU to find its memory capacity. To workaround this issue you should either specify values for the swapout_threshold, swapin_groupby, and swapin_ahead parameters, which avoids the auto-tuning, or you can explicitly set the memory auto tuning level using the autotune_gpu_mem LMS property.

So either:

    lms_hook = LMS(swapout_threshold=1, swapin_groupby=0, swapin_ahead=1) # These are the max swapping, slowest data throughput parameters. Adding sync_mode=3 would also allow for higher amount of data.

or

    lms_hook = LMS()
    lms.autotune_gpu_mem = 15 # For using 15 GB of GPU memory to leave a bit for base overhead usage.

On the topic of BERT, @FarrandTom has trained BERT large on 16GB GPUs using the TFLMSv2 version. He wrote this post about it: https://medium.com/systems-ai/the-consequence-of-modern-nlp-approaches-647b2cabc5ec

I would suggest reading that article as he mentions which LMS tuning parameter values he used.

bethune-bryant commented 4 years ago

@agemagician Tensorflow 2.1 in the early access channel is built using Cuda 10.2, and summit has the 10.1 driver installed. You'll need to use the cuda compatibility libraries. You can do this by installing the cudatoolkit-dev conda package and adding $CONDA_PREFIX/compat to your LD_LIBRARY_PATH

agemagician commented 4 years ago

@smatzek Thanks a lot for the detailed explanation and for the blog post. The first method did work out pretty well: lms_hook = LMS(swapout_threshold=1, swapin_groupby=0, swapin_ahead=1) # These are the max swapping, slowest data throughput parameters. Adding sync_mode=3 would also allow for higher amount of data. However, the second method didn't work:

 lms_hook = LMS()
    lms_hook.autotune_gpu_mem = 15 # For using 15 GB of GPU memory to leave a bit for base overhead usage.

@bethune-bryant Thanks a lot. I am trying now to use the early access channel, but I had to use a clone from "ibm-wml-ce/1.6.2-2" module because of SUMMIT license, and this makes solving environment conflicts from conda takes a lot of time.

smatzek commented 4 years ago

@agemagician I will look into the autotune_gpu_mem method not working for you. Can you share the error message you got using that method? Was it the same tensorflow.python.framework.errors_impl.InvalidArgumentError: Invalid device ordinal value (1). Valid range is [0, 0]. or was it an out of memory error?

With that set and with LMS constructed like this: LMS() , TFLMSv2 will attempt to measure your model size and auto-tune / predict the optimal runtime values for the tunable parameters. This has various levels of success depending on your model.

bethune-bryant commented 4 years ago

@agemagician You don't necessarily have to clone all of ibm-wml-ce/1.6.2-2. If you module load ibm-wml-ce and then conda create -n my_env python=3.6 ddl etc... you can install just the packages you need, which will install from the summit specific channel (as long as your .condarc file doesn't have anything higher priority in it). Once those are installed you can then add the early access channel to my_env as shown in the blog post, and install the early access packages you need.

agemagician commented 4 years ago

@bethune-bryant unfortunately, after I did as you recommended, I got an error everytime I try to run ddlrun:

/sw/summit/ibm-wml-ce/anaconda-base/etc/profile.d/conda.sh: line 0: source: filename argument required
source: usage: source filename [arguments]

/sw/summit/ibm-wml-ce/anaconda-base/etc/profile.d/conda.sh: line 0: source: filename argument required
source: usage: source filename [arguments]

/sw/summit/ibm-wml-ce/anaconda-base/etc/profile.d/conda.sh: line 0: source: filename argument required
source: usage: source filename [arguments]

[ERROR DDL-2-0] Unexpected Error: Sequence index out of range.
Traceback (most recent call last):
  File "/ccs/proj/bif120/deepforce/virtualenvs/ibm_wml_ce_beta/bin/ddlrun", line 50, in main
    cores = hardware_checks.verify_host_configs(args)
  File "/ccs/proj/bif120/deepforce/virtualenvs/ibm_wml_ce_beta/lib/python3.6/site-packages/ddlrun/hardware_checks.py", line 46, in verify_host_configs
    host_cores.append(split_config[1])
IndexError: list index out of range
Please see /tmp/DDLRUN/DDLRUN.ew1BrFeootoh/ddlrun.log for detailed log.

This error doesn't occur on any other conda environment except the new one:

# Name                    Version                   Build  Channel
_libgcc_mutex             0.1                 conda_forge    conda-forge
_openmp_mutex             4.5                       0_gnu    conda-forge
_tflow_select             2.1.0           gpu_913.g4f6e601    https://public.dhe.ibm.com/ibmdl/export/pub/software/server/ibm-ai/conda-early-access
absl-py                   0.8.1                    py36_0  
asn1crypto                1.3.0                    py36_0  
astor                     0.8.0                    py36_0  
atomicwrites              1.3.0                    py36_1  
attrs                     19.3.0                     py_0  
c-ares                    1.15.0            h7b6447c_1001  
ca-certificates           2020.1.1                      0  
certifi                   2019.11.28               py36_0  
cffi                      1.12.3           py36h2e261b9_0  
chardet                   3.0.4                 py36_1003  
cloudpickle               1.2.2                      py_0  
cryptography              2.8              py36h1ba5d50_0  
cudatoolkit               10.2.89            654.g0f7a43a    https://public.dhe.ibm.com/ibmdl/export/pub/software/server/ibm-ai/conda-early-access
cudnn                     7.6.5_10.2         624.g338a052    https://public.dhe.ibm.com/ibmdl/export/pub/software/server/ibm-ai/conda-early-access
ddl                       1.5.1           py36_1355.ga7f65f4    https://public.dhe.ibm.com/ibmdl/export/pub/software/server/ibm-ai/conda-early-access
freetype                  2.9.1                h8a8886c_0  
gast                      0.2.2                    py36_0  
google-pasta              0.1.8           py36_620.gd00f35a    https://public.dhe.ibm.com/ibmdl/export/pub/software/server/ibm-ai/conda-early-access
graphsurgeon              0.4.1           py36_683.g120274a    https://public.dhe.ibm.com/ibmdl/export/pub/software/server/ibm-ai/conda-early-access
grpcio                    1.16.1           py36hf8bcb03_1  
h5py                      2.10.0          nompi_py36h25dc415_102    conda-forge
hdf5                      1.10.5          nompi_h9bc996f_1104    conda-forge
horovod                   0.19.0          py36_1096.g1e4bf23    https://public.dhe.ibm.com/ibmdl/export/pub/software/server/ibm-ai/conda-early-access
idna                      2.8                      py36_0  
importlib_metadata        1.5.0                    py36_0  
jpeg                      9b                   hcb7ba68_2  
keras-applications        1.0.8                      py_0  
keras-preprocessing       1.1.0                      py_1  
libblas                   3.8.0               14_openblas    conda-forge
libcblas                  3.8.0               14_openblas    conda-forge
libffi                    3.2.1             hb209c28_1006    conda-forge
libgcc-ng                 8.2.0                hdd5993f_5    conda-forge
libgfortran-ng            8.2.0                h822a55f_5    conda-forge
libgomp                   8.2.0                hdd5993f_5    conda-forge
liblapack                 3.8.0               14_openblas    conda-forge
libopenblas               0.3.7                ha38281c_6    conda-forge
libpng                    1.6.37               hbc83047_0  
libprotobuf               3.8.0              632.g08dc819    https://public.dhe.ibm.com/ibmdl/export/pub/software/server/ibm-ai/conda-early-access
libstdcxx-ng              8.2.0                h822a55f_5    conda-forge
libtiff                   4.1.0                h2733197_0  
markdown                  3.1.1                    py36_0  
more-itertools            8.2.0                      py_0  
nccl                      2.5.6              619.g51c2e94    https://public.dhe.ibm.com/ibmdl/export/pub/software/server/ibm-ai/conda-early-access
ncurses                   6.1               hf484d3e_1002    conda-forge
numactl                   2.0.12             626.gb5e1afd    https://public.dhe.ibm.com/ibmdl/export/pub/software/server/ibm-ai/conda-early-access
numpy                     1.17.5           py36hcee8f07_0    conda-forge
olefile                   0.46                     py36_0  
openssl                   1.1.1d               h7b6447c_4  
opt_einsum                3.1.0                      py_0  
pciutils                  3.6.2              625.g804ec60    https://public.dhe.ibm.com/ibmdl/export/pub/software/server/ibm-ai/conda-early-access
pillow                    7.0.0            py36haac5956_0  
pip                       20.0.2                     py_2    conda-forge
pluggy                    0.13.1                   py36_0  
powerai-license           1.7.0.a0           772.ge074133    https://public.dhe.ibm.com/ibmdl/export/pub/software/server/ibm-ai/conda-early-access
powerai-release           1.7.0.a0           625.g1c389a2    https://public.dhe.ibm.com/ibmdl/export/pub/software/server/ibm-ai/conda-early-access
powerai-tools             1.7.0.a0           621.g843ad38    https://public.dhe.ibm.com/ibmdl/export/pub/software/server/ibm-ai/conda-early-access
protobuf                  3.8.0           py36_640.gdc7b773    https://public.dhe.ibm.com/ibmdl/export/pub/software/server/ibm-ai/conda-early-access
psutil                    5.6.7            py36h7b6447c_0  
py                        1.8.1                      py_0  
pycparser                 2.19                     py36_0  
pyopenssl                 19.1.0                   py36_0  
pysocks                   1.7.1                    py36_0  
pytest                    4.4.2                    py36_0  
python                    3.6.7             h88bc6d3_1006    conda-forge
pyyaml                    5.1.2            py36h6eb9509_1    conda-forge
readline                  8.0                  hf8c457e_0    conda-forge
requests                  2.22.0                   py36_1  
scipy                     1.4.1            py36h807e534_0    conda-forge
setuptools                45.2.0                   py36_0    conda-forge
six                       1.13.0                   py36_0  
spectrum-mpi              10.03              676.ga72dafb    https://public.dhe.ibm.com/ibmdl/export/pub/software/server/ibm-ai/conda-early-access
sqlite                    3.30.1               hd61ad8c_0    conda-forge
tensorboard               2.1.0           py36_3dc74fe_3939.g4f6e601    https://public.dhe.ibm.com/ibmdl/export/pub/software/server/ibm-ai/conda-early-access
tensorflow                2.1.0           gpu_py36_914.g4f6e601    https://public.dhe.ibm.com/ibmdl/export/pub/software/server/ibm-ai/conda-early-access
tensorflow-base           2.1.0           gpu_py36_e5bf8de_72632.gbc9303f    https://public.dhe.ibm.com/ibmdl/export/pub/software/server/ibm-ai/conda-early-access
tensorflow-estimator      2.1.0           py36_7ec4e5d_1461.g4f6e601    https://public.dhe.ibm.com/ibmdl/export/pub/software/server/ibm-ai/conda-early-access
tensorflow-gpu            2.1.0              914.g4f6e601    https://public.dhe.ibm.com/ibmdl/export/pub/software/server/ibm-ai/conda-early-access
tensorrt                  7.0.0.11        py36_683.g120274a    https://public.dhe.ibm.com/ibmdl/export/pub/software/server/ibm-ai/conda-early-access
termcolor                 1.1.0                    py36_1  
tk                        8.6.10               h151fe60_0    conda-forge
uff                       0.6.5           py36_683.g120274a    https://public.dhe.ibm.com/ibmdl/export/pub/software/server/ibm-ai/conda-early-access
urllib3                   1.25.8                   py36_0  
werkzeug                  0.16.0                     py_0  
wheel                     0.34.2                     py_1    conda-forge
wrapt                     1.11.2           py36h7b6447c_0  
xz                        5.2.4             h14c3975_1001    conda-forge
yaml                      0.2.2                h6eb9509_1    conda-forge
zipp                      2.2.0                      py_0  
zlib                      1.2.11            h6eb9509_1006    conda-forge
zstd                      1.3.7                h0b5b093_0

agemagician commented 4 years ago

@smatzek I got the following error:

I0214 16:05:54.669146 35184372402448 lms.py:1275] [LMS][0] Editing model for LMS
I0214 16:05:55.062001 35184372402448 lms.py:1275] [LMS][0] Editing model for LMS
I0214 16:05:55.781136 35184372402448 lms.py:1275] [LMS][0] Editing model for LMS
I0214 16:05:55.931550 35184372402448 lms.py:1275] [LMS][0] Editing model for LMS
I0214 16:05:56.013345 35184372402448 lms.py:1275] [LMS][0] Editing model for LMS
I0214 16:05:56.534431 35184372402448 lms.py:1275] [LMS][0] Editing model for LMS
I0214 16:05:56.731864 35184372402448 lms.py:1275] [LMS][0] The graph has 58391 vertices and 72072 edges.
I0214 16:05:56.732050 35184372402448 lms.py:1275] [LMS][0] The graph has 1451.68 MiB of learning parameters.
I0214 16:05:56.732136 35184372402448 lms.py:1275] [LMS][0] The largest GPU operation is bert/encoder/layer_0/attention/self/dropout/mul_1 consuming 0.42 GiB
I0214 16:05:57.133826 35184372402448 lms.py:1275] [LMS][0] The graph has 58391 vertices and 72072 edges.
I0214 16:05:57.134008 35184372402448 lms.py:1275] [LMS][0] The graph has 1451.68 MiB of learning parameters.
I0214 16:05:57.134092 35184372402448 lms.py:1275] [LMS][0] The largest GPU operation is bert/encoder/layer_0/attention/self/dropout/mul_1 consuming 0.42 GiB
I0214 16:05:57.868971 35184372402448 lms.py:1275] [LMS][0] The graph has 58391 vertices and 72072 edges.
I0214 16:05:57.869180 35184372402448 lms.py:1275] [LMS][0] The graph has 1451.68 MiB of learning parameters.
I0214 16:05:57.869258 35184372402448 lms.py:1275] [LMS][0] The largest GPU operation is bert/encoder/layer_0/attention/self/dropout/mul_1 consuming 0.42 GiB
I0214 16:05:58.000467 35184372402448 lms.py:1275] [LMS][0] The graph has 58391 vertices and 72072 edges.
I0214 16:05:58.000693 35184372402448 lms.py:1275] [LMS][0] The graph has 1451.68 MiB of learning parameters.
I0214 16:05:58.000778 35184372402448 lms.py:1275] [LMS][0] The largest GPU operation is bert/encoder/layer_0/attention/self/dropout/mul_1 consuming 0.42 GiB
I0214 16:05:58.088652 35184372402448 lms.py:1275] [LMS][0] The graph has 58391 vertices and 72072 edges.
I0214 16:05:58.088860 35184372402448 lms.py:1275] [LMS][0] The graph has 1451.68 MiB of learning parameters.
I0214 16:05:58.088947 35184372402448 lms.py:1275] [LMS][0] The largest GPU operation is bert/encoder/layer_0/attention/self/dropout/mul_1 consuming 0.42 GiB
I0214 16:05:58.589145 35184372402448 lms.py:1275] [LMS][0] The graph has 58391 vertices and 72072 edges.
I0214 16:05:58.589344 35184372402448 lms.py:1275] [LMS][0] The graph has 1451.68 MiB of learning parameters.
I0214 16:05:58.589427 35184372402448 lms.py:1275] [LMS][0] The largest GPU operation is bert/encoder/layer_0/attention/self/dropout/mul_1 consuming 0.42 GiB
I0214 16:11:37.379312 35184372402448 lms.py:1275] [LMS][0] Original categorized topological sort has 6815 levels.
I0214 16:11:37.906052 35184372402448 lms.py:1275] [LMS][0] Searching values for parameters: swapout_threshold, swapin_ahead, swapin_groupby and sync_mode. 
I0214 16:11:37.976045 35184372402448 lms.py:1275] [LMS][0] [Simulator] Available memory for simulation: 10.87 GiB (memory ratio: 0.8)
I0214 16:11:37.976279 35184372402448 lms.py:1275] [LMS][0] [Simulator] Available CPU memory for simulation: 64.0 GiB
I0214 16:11:38.402456 35184372402448 lms.py:1275] [LMS][0] Original categorized topological sort has 6815 levels.
I0214 16:11:38.920334 35184372402448 lms.py:1275] [LMS][0] Searching values for parameters: swapout_threshold, swapin_ahead, swapin_groupby and sync_mode. 
I0214 16:11:38.990986 35184372402448 lms.py:1275] [LMS][0] [Simulator] Available memory for simulation: 10.87 GiB (memory ratio: 0.8)
I0214 16:11:38.991198 35184372402448 lms.py:1275] [LMS][0] [Simulator] Available CPU memory for simulation: 64.0 GiB
I0214 16:11:40.429673 35184372402448 lms.py:1275] [LMS][0] Original categorized topological sort has 6815 levels.
I0214 16:11:40.940869 35184372402448 lms.py:1275] [LMS][0] Searching values for parameters: swapout_threshold, swapin_ahead, swapin_groupby and sync_mode. 
E0214 16:11:40.999454 35184372402448 error_handling.py:75] Error recorded from training_loop: Auto-tuning was unable to find a value for swapout_threshold. Please specify it manually.
I0214 16:11:40.999622 35184372402448 error_handling.py:101] training_loop marked as finished
W0214 16:11:40.999709 35184372402448 error_handling.py:135] Reraising captured error
I0214 16:11:41.009051 35184372402448 lms.py:1275] [LMS][0] [Simulator] Available memory for simulation: 10.87 GiB (memory ratio: 0.8)
I0214 16:11:41.009251 35184372402448 lms.py:1275] [LMS][0] [Simulator] Available CPU memory for simulation: 64.0 GiB
Traceback (most recent call last):
  File "run_pretraining_hvd.py", line 542, in <module>
    tf.app.run()
  File "/ccs/proj/bif120/deepforce/virtualenvs/ibm_wml_ce-1.6.2-2/lib/python3.6/site-packages/tensorflow_core/python/platform/app.py", line 40, in run
    _run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef)
  File "/ccs/proj/bif120/deepforce/virtualenvs/ibm_wml_ce-1.6.2-2/lib/python3.6/site-packages/absl/app.py", line 300, in run
    _run_main(main, args)
  File "/ccs/proj/bif120/deepforce/virtualenvs/ibm_wml_ce-1.6.2-2/lib/python3.6/site-packages/absl/app.py", line 251, in _run_main
    sys.exit(main(argv))
  File "run_pretraining_hvd.py", line 515, in main
    estimator.train(input_fn=train_input_fn, max_steps=FLAGS.num_train_steps, hooks=hooks)
  File "/ccs/proj/bif120/deepforce/virtualenvs/ibm_wml_ce-1.6.2-2/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 3035, in train
    rendezvous.raise_errors()
  File "/ccs/proj/bif120/deepforce/virtualenvs/ibm_wml_ce-1.6.2-2/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/tpu/error_handling.py", line 136, in raise_errors
    six.reraise(typ, value, traceback)
  File "/ccs/proj/bif120/deepforce/virtualenvs/ibm_wml_ce-1.6.2-2/lib/python3.6/site-packages/six.py", line 693, in reraise
    raise value
  File "/ccs/proj/bif120/deepforce/virtualenvs/ibm_wml_ce-1.6.2-2/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 3030, in train
    saving_listeners=saving_listeners)
  File "/ccs/proj/bif120/deepforce/virtualenvs/ibm_wml_ce-1.6.2-2/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 370, in train
    loss = self._train_model(input_fn, hooks, saving_listeners)
  File "/ccs/proj/bif120/deepforce/virtualenvs/ibm_wml_ce-1.6.2-2/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1161, in _train_model
    return self._train_model_default(input_fn, hooks, saving_listeners)
  File "/ccs/proj/bif120/deepforce/virtualenvs/ibm_wml_ce-1.6.2-2/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1195, in _train_model_default
    saving_listeners)
  File "/ccs/proj/bif120/deepforce/virtualenvs/ibm_wml_ce-1.6.2-2/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1490, in _train_with_estimator_spec
    log_step_count_steps=log_step_count_steps) as mon_sess:
  File "/ccs/proj/bif120/deepforce/virtualenvs/ibm_wml_ce-1.6.2-2/lib/python3.6/site-packages/tensorflow_core/python/training/monitored_session.py", line 584, in MonitoredTrainingSession
    stop_grace_period_secs=stop_grace_period_secs)
  File "/ccs/proj/bif120/deepforce/virtualenvs/ibm_wml_ce-1.6.2-2/lib/python3.6/site-packages/tensorflow_core/python/training/monitored_session.py", line 1014, in __init__
    stop_grace_period_secs=stop_grace_period_secs)
  File "/ccs/proj/bif120/deepforce/virtualenvs/ibm_wml_ce-1.6.2-2/lib/python3.6/site-packages/tensorflow_core/python/training/monitored_session.py", line 713, in __init__
    h.begin()
  File "/ccs/proj/bif120/deepforce/virtualenvs/ibm_wml_ce-1.6.2-2/lib/python3.6/site-packages/tflms-2.0.2-py3.6.egg/tensorflow_large_model_support/lms.py", line 1538, in begin
  File "/ccs/proj/bif120/deepforce/virtualenvs/ibm_wml_ce-1.6.2-2/lib/python3.6/site-packages/tflms-2.0.2-py3.6.egg/tensorflow_large_model_support/lms.py", line 461, in run
  File "/ccs/proj/bif120/deepforce/virtualenvs/ibm_wml_ce-1.6.2-2/lib/python3.6/site-packages/tflms-2.0.2-py3.6.egg/tensorflow_large_model_support/lms.py", line 972, in _validate_parameters
ValueError: Auto-tuning was unable to find a value for swapout_threshold. Please specify it manually.
I0214 16:11:41.501451 35184372402448 lms.py:1275] [LMS][0] Original categorized topological sort has 6815 levels.
I0214 16:11:42.030883 35184372402448 lms.py:1275] [LMS][0] Searching values for parameters: swapout_threshold, swapin_ahead, swapin_groupby and sync_mode. 
E0214 16:11:42.097813 35184372402448 error_handling.py:75] Error recorded from training_loop: Auto-tuning was unable to find a value for swapout_threshold. Please specify it manually.
I0214 16:11:42.098005 35184372402448 error_handling.py:101] training_loop marked as finished
W0214 16:11:42.098096 35184372402448 error_handling.py:135] Reraising captured error
I0214 16:11:42.101529 35184372402448 lms.py:1275] [LMS][0] [Simulator] Available memory for simulation: 10.87 GiB (memory ratio: 0.8)
I0214 16:11:42.101739 35184372402448 lms.py:1275] [LMS][0] [Simulator] Available CPU memory for simulation: 64.0 GiB
Traceback (most recent call last):
  File "run_pretraining_hvd.py", line 542, in <module>
    tf.app.run()
  File "/ccs/proj/bif120/deepforce/virtualenvs/ibm_wml_ce-1.6.2-2/lib/python3.6/site-packages/tensorflow_core/python/platform/app.py", line 40, in run
    _run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef)
  File "/ccs/proj/bif120/deepforce/virtualenvs/ibm_wml_ce-1.6.2-2/lib/python3.6/site-packages/absl/app.py", line 300, in run
    _run_main(main, args)
  File "/ccs/proj/bif120/deepforce/virtualenvs/ibm_wml_ce-1.6.2-2/lib/python3.6/site-packages/absl/app.py", line 251, in _run_main
    sys.exit(main(argv))
  File "run_pretraining_hvd.py", line 515, in main
    estimator.train(input_fn=train_input_fn, max_steps=FLAGS.num_train_steps, hooks=hooks)
  File "/ccs/proj/bif120/deepforce/virtualenvs/ibm_wml_ce-1.6.2-2/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 3035, in train
    rendezvous.raise_errors()
  File "/ccs/proj/bif120/deepforce/virtualenvs/ibm_wml_ce-1.6.2-2/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/tpu/error_handling.py", line 136, in raise_errors
    six.reraise(typ, value, traceback)
  File "/ccs/proj/bif120/deepforce/virtualenvs/ibm_wml_ce-1.6.2-2/lib/python3.6/site-packages/six.py", line 693, in reraise
    raise value
  File "/ccs/proj/bif120/deepforce/virtualenvs/ibm_wml_ce-1.6.2-2/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 3030, in train
    saving_listeners=saving_listeners)
  File "/ccs/proj/bif120/deepforce/virtualenvs/ibm_wml_ce-1.6.2-2/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 370, in train
    loss = self._train_model(input_fn, hooks, saving_listeners)
  File "/ccs/proj/bif120/deepforce/virtualenvs/ibm_wml_ce-1.6.2-2/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1161, in _train_model
    return self._train_model_default(input_fn, hooks, saving_listeners)
  File "/ccs/proj/bif120/deepforce/virtualenvs/ibm_wml_ce-1.6.2-2/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1195, in _train_model_default
    saving_listeners)
  File "/ccs/proj/bif120/deepforce/virtualenvs/ibm_wml_ce-1.6.2-2/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1490, in _train_with_estimator_spec
    log_step_count_steps=log_step_count_steps) as mon_sess:
  File "/ccs/proj/bif120/deepforce/virtualenvs/ibm_wml_ce-1.6.2-2/lib/python3.6/site-packages/tensorflow_core/python/training/monitored_session.py", line 584, in MonitoredTrainingSession
    stop_grace_period_secs=stop_grace_period_secs)
  File "/ccs/proj/bif120/deepforce/virtualenvs/ibm_wml_ce-1.6.2-2/lib/python3.6/site-packages/tensorflow_core/python/training/monitored_session.py", line 1014, in __init__
    stop_grace_period_secs=stop_grace_period_secs)
  File "/ccs/proj/bif120/deepforce/virtualenvs/ibm_wml_ce-1.6.2-2/lib/python3.6/site-packages/tensorflow_core/python/training/monitored_session.py", line 713, in __init__
    h.begin()
  File "/ccs/proj/bif120/deepforce/virtualenvs/ibm_wml_ce-1.6.2-2/lib/python3.6/site-packages/tflms-2.0.2-py3.6.egg/tensorflow_large_model_support/lms.py", line 1538, in begin
  File "/ccs/proj/bif120/deepforce/virtualenvs/ibm_wml_ce-1.6.2-2/lib/python3.6/site-packages/tflms-2.0.2-py3.6.egg/tensorflow_large_model_support/lms.py", line 461, in run
  File "/ccs/proj/bif120/deepforce/virtualenvs/ibm_wml_ce-1.6.2-2/lib/python3.6/site-packages/tflms-2.0.2-py3.6.egg/tensorflow_large_model_support/lms.py", line 972, in _validate_parameters
ValueError: Auto-tuning was unable to find a value for swapout_threshold. Please specify it manually.
I0214 16:11:43.265153 35184372402448 lms.py:1275] [LMS][0] Original categorized topological sort has 6815 levels.
I0214 16:11:43.779389 35184372402448 lms.py:1275] [LMS][0] Original categorized topological sort has 6815 levels.
I0214 16:11:43.794763 35184372402448 lms.py:1275] [LMS][0] Searching values for parameters: swapout_threshold, swapin_ahead, swapin_groupby and sync_mode. 
I0214 16:11:43.864676 35184372402448 lms.py:1275] [LMS][0] [Simulator] Available memory for simulation: 10.87 GiB (memory ratio: 0.8)
I0214 16:11:43.864907 35184372402448 lms.py:1275] [LMS][0] [Simulator] Available CPU memory for simulation: 64.0 GiB
E0214 16:11:43.912534 35184372402448 error_handling.py:75] Error recorded from training_loop: Auto-tuning was unable to find a value for swapout_threshold. Please specify it manually.
I0214 16:11:43.912682 35184372402448 error_handling.py:101] training_loop marked as finished
W0214 16:11:43.912794 35184372402448 error_handling.py:135] Reraising captured error
Traceback (most recent call last):
  File "run_pretraining_hvd.py", line 542, in <module>
    tf.app.run()
  File "/ccs/proj/bif120/deepforce/virtualenvs/ibm_wml_ce-1.6.2-2/lib/python3.6/site-packages/tensorflow_core/python/platform/app.py", line 40, in run
    _run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef)
  File "/ccs/proj/bif120/deepforce/virtualenvs/ibm_wml_ce-1.6.2-2/lib/python3.6/site-packages/absl/app.py", line 300, in run
    _run_main(main, args)
  File "/ccs/proj/bif120/deepforce/virtualenvs/ibm_wml_ce-1.6.2-2/lib/python3.6/site-packages/absl/app.py", line 251, in _run_main
    sys.exit(main(argv))
  File "run_pretraining_hvd.py", line 515, in main
    estimator.train(input_fn=train_input_fn, max_steps=FLAGS.num_train_steps, hooks=hooks)
  File "/ccs/proj/bif120/deepforce/virtualenvs/ibm_wml_ce-1.6.2-2/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 3035, in train
    rendezvous.raise_errors()
  File "/ccs/proj/bif120/deepforce/virtualenvs/ibm_wml_ce-1.6.2-2/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/tpu/error_handling.py", line 136, in raise_errors
    six.reraise(typ, value, traceback)
  File "/ccs/proj/bif120/deepforce/virtualenvs/ibm_wml_ce-1.6.2-2/lib/python3.6/site-packages/six.py", line 693, in reraise
    raise value
  File "/ccs/proj/bif120/deepforce/virtualenvs/ibm_wml_ce-1.6.2-2/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 3030, in train
    saving_listeners=saving_listeners)
  File "/ccs/proj/bif120/deepforce/virtualenvs/ibm_wml_ce-1.6.2-2/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 370, in train
    loss = self._train_model(input_fn, hooks, saving_listeners)
  File "/ccs/proj/bif120/deepforce/virtualenvs/ibm_wml_ce-1.6.2-2/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1161, in _train_model
    return self._train_model_default(input_fn, hooks, saving_listeners)
  File "/ccs/proj/bif120/deepforce/virtualenvs/ibm_wml_ce-1.6.2-2/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1195, in _train_model_default
    saving_listeners)
  File "/ccs/proj/bif120/deepforce/virtualenvs/ibm_wml_ce-1.6.2-2/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1490, in _train_with_estimator_spec
    log_step_count_steps=log_step_count_steps) as mon_sess:
  File "/ccs/proj/bif120/deepforce/virtualenvs/ibm_wml_ce-1.6.2-2/lib/python3.6/site-packages/tensorflow_core/python/training/monitored_session.py", line 584, in MonitoredTrainingSession
    stop_grace_period_secs=stop_grace_period_secs)
  File "/ccs/proj/bif120/deepforce/virtualenvs/ibm_wml_ce-1.6.2-2/lib/python3.6/site-packages/tensorflow_core/python/training/monitored_session.py", line 1014, in __init__
    stop_grace_period_secs=stop_grace_period_secs)
  File "/ccs/proj/bif120/deepforce/virtualenvs/ibm_wml_ce-1.6.2-2/lib/python3.6/site-packages/tensorflow_core/python/training/monitored_session.py", line 713, in __init__
    h.begin()
  File "/ccs/proj/bif120/deepforce/virtualenvs/ibm_wml_ce-1.6.2-2/lib/python3.6/site-packages/tflms-2.0.2-py3.6.egg/tensorflow_large_model_support/lms.py", line 1538, in begin
  File "/ccs/proj/bif120/deepforce/virtualenvs/ibm_wml_ce-1.6.2-2/lib/python3.6/site-packages/tflms-2.0.2-py3.6.egg/tensorflow_large_model_support/lms.py", line 461, in run
  File "/ccs/proj/bif120/deepforce/virtualenvs/ibm_wml_ce-1.6.2-2/lib/python3.6/site-packages/tflms-2.0.2-py3.6.egg/tensorflow_large_model_support/lms.py", line 972, in _validate_parameters
ValueError: Auto-tuning was unable to find a value for swapout_threshold. Please specify it manually.
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
I0214 16:11:44.300842 35184372402448 lms.py:1275] [LMS][0] Searching values for parameters: swapout_threshold, swapin_ahead, swapin_groupby and sync_mode. 
I0214 16:11:44.370010 35184372402448 lms.py:1275] [LMS][0] [Simulator] Available memory for simulation: 10.87 GiB (memory ratio: 0.8)
I0214 16:11:44.370211 35184372402448 lms.py:1275] [LMS][0] [Simulator] Available CPU memory for simulation: 64.0 GiB
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[53578,1],4]
  Exit code:    1

smatzek commented 4 years ago

Thanks @agemagician, the error you are now getting is not the Horovod-LMS first touch issue. It is this, "Auto-tuning was unable to find a value for swapout_threshold" which means that LMS was unable to find any values for the tunable parameters using its auto-tuning simulator and you need to manually specify them, which you have already done to successfully avoid the Horovod-LMS issue in previous attempts.

bethune-bryant commented 4 years ago

@agemagician I'm sorry, I didn't think about it pulling down ddlrun from the EA channel too.

@bethune-bryant unfortunately, after I did as you recommended, I got an error everytime I try to run ddlrun:

The version of ddlrun you were pulling from the early access channel has jsrun integration that is not yet supported by the jsrun version on Summit. To get around that error with the ddlrun from EA you can add the argument: ddlrun --launcher "mpirun" ...

agemagician commented 4 years ago

@smatzek @bethune-bryant Thanks a lot for all your help. You are the best :)

IBM / tensorflow-large-model-support

LMS + Horovod on SUMMIT #24