[BUG] HPS tensorflow plugin, multi-gpu example crashes

Describe the bug HPS example hps_pretrained_model_training_demo.ipynb crashes.

To Reproduce Steps to reproduce the behavior:

docker pull nvcr.io/nvidia/merlin/merlin-tensorflow:22.09
docker run --runtime=nvidia -it nvcr.io/nvidia/merlin/merlin-tensorflow:22.09 bash
cd /hugectr/hierarchical_parameter_server/notebooks
# run all code blocks of hps_pretrained_model_training_demo.ipynb with dnn.json in notebook

Expected behavior The pre-trained model can be loaded with HPS & trained.

Screenshots

root@b96bb42adece:/hugectr/hierarchical_parameter_server/notebooks# python hps_pretrained_model_training_demo.py
[INFO] hierarchical_parameter_server is imported
2022-10-01 14:15:47.658583: I tensorflow/core/util/util.cc:169] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
/usr/lib/python3/dist-packages/requests/__init__.py:89: RequestsDependencyWarning: urllib3 (1.26.12) or chardet (3.0.4) doesn't match a supported version!
  warnings.warn("urllib3 ({}) or chardet ({}) doesn't match a supported "
2022-10-01 14:15:49.377492: I tensorflow/core/platform/cpu_feature_guard.cc:194] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  SSE3 SSE4.1 SSE4.2 AVX
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-10-01 14:15:51.011734: W tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:42] Overriding orig_value setting because the TF_FORCE_GPU_ALLOW_GROWTH environment variable is set. Original config value was 0.
2022-10-01 14:15:51.011800: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1532] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 77658 MB memory:  -> device: 0, name: NVIDIA A100-SXM4-80GB, pci bus id: 0000:1f:00.0, compute capability: 8.0
2022-10-01 14:15:51.013072: W tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:42] Overriding orig_value setting because the TF_FORCE_GPU_ALLOW_GROWTH environment variable is set. Original config value was 0.
2022-10-01 14:15:51.013100: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1532] Created device /job:localhost/replica:0/task:0/device:GPU:1 with 77658 MB memory:  -> device: 1, name: NVIDIA A100-SXM4-80GB, pci bus id: 0000:25:00.0, compute capability: 8.0
2022-10-01 14:15:51.014485: W tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:42] Overriding orig_value setting because the TF_FORCE_GPU_ALLOW_GROWTH environment variable is set. Original config value was 0.
2022-10-01 14:15:51.014514: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1532] Created device /job:localhost/replica:0/task:0/device:GPU:2 with 77658 MB memory:  -> device: 2, name: NVIDIA A100-SXM4-80GB, pci bus id: 0000:50:00.0, compute capability: 8.0
2022-10-01 14:15:51.015644: W tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:42] Overriding orig_value setting because the TF_FORCE_GPU_ALLOW_GROWTH environment variable is set. Original config value was 0.
2022-10-01 14:15:51.015668: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1532] Created device /job:localhost/replica:0/task:0/device:GPU:3 with 77658 MB memory:  -> device: 3, name: NVIDIA A100-SXM4-80GB, pci bus id: 0000:55:00.0, compute capability: 8.0
WARNING:tensorflow:The following Variables were used in a Lambda layer's call (tf.compat.v1.nn.embedding_lookup_sparse), but are not present in its tracked objects:   <tf.Variable 'Variable:0' shape=(100000, 16) dtype=float32>. This is a strong indication that the Lambda layer should be rewritten as a subclassed Layer.
Model: "model"
__________________________________________________________________________________________________
 Layer (type)                   Output Shape         Param #     Connected to                     
==================================================================================================
 input_1 (InputLayer)           [(None, 5)]          0           []                               

 tf.compat.v1.nn.embedding_look  (None, 16)          0           ['input_1[0][0]']                
 up_sparse (TFOpLambda)                                                                           

 tf.reshape (TFOpLambda)        (None, 160)          0           ['tf.compat.v1.nn.embedding_looku
                                                                 p_sparse[0][0]']                 

 input_2 (InputLayer)           [(None, 10)]         0           []                               

 tf.concat (TFOpLambda)         (None, 170)          0           ['tf.reshape[0][0]',             
                                                                  'input_2[0][0]']                

 fc1 (Dense)                    (None, 1024)         175104      ['tf.concat[0][0]']              

 fc2 (Dense)                    (None, 256)          262400      ['fc1[0][0]']                    

 fc3 (Dense)                    (None, 1)            257         ['fc2[0][0]']                    

==================================================================================================
Total params: 437,761
Trainable params: 437,761
Non-trainable params: 0
__________________________________________________________________________________________________
WARNING:tensorflow:Using MirroredStrategy eagerly has significant overhead currently. We will be working on improving this in the future, but for now please wrap `call_for_each_replica` or `experimental_run` or `run` inside a tf.function to get the best performance.
WARNING:tensorflow:Using MirroredStrategy eagerly has significant overhead currently. We will be working on improving this in the future, but for now please wrap `call_for_each_replica` or `experimental_run` or `run` inside a tf.function to get the best performance.
2022-10-01 14:15:54.925159: I tensorflow/stream_executor/cuda/cuda_blas.cc:1804] TensorFloat-32 will be used for the matrix multiplication. This will only be logged once.
/usr/local/lib/python3.8/dist-packages/tensorflow/python/util/dispatch.py:1082: UserWarning: "`binary_crossentropy` received `from_logits=True`, but the `output` argument was produced by a sigmoid or softmax activation and thus does not represent logits. Was this intended?"
  return dispatch_target(*args, **kwargs)
WARNING:tensorflow:Efficient allreduce is not supported for 1 IndexedSlices
-------------------- Step 0, loss: PerReplica:{
  0: tf.Tensor(0.17562991, shape=(), dtype=float32),
  1: tf.Tensor(0.17909361, shape=(), dtype=float32),
  2: tf.Tensor(0.17878108, shape=(), dtype=float32),
  3: tf.Tensor(0.17324439, shape=(), dtype=float32)
} --------------------
WARNING:tensorflow:Using MirroredStrategy eagerly has significant overhead currently. We will be working on improving this in the future, but for now please wrap `call_for_each_replica` or `experimental_run` or `run` inside a tf.function to get the best performance.
WARNING:tensorflow:Using MirroredStrategy eagerly has significant overhead currently. We will be working on improving this in the future, but for now please wrap `call_for_each_replica` or `experimental_run` or `run` inside a tf.function to get the best performance.
WARNING:tensorflow:Efficient allreduce is not supported for 1 IndexedSlices
-------------------- Step 1, loss: PerReplica:{
  0: tf.Tensor(653.8149, shape=(), dtype=float32),
  1: tf.Tensor(693.7608, shape=(), dtype=float32),
  2: tf.Tensor(613.2731, shape=(), dtype=float32),
  3: tf.Tensor(628.3385, shape=(), dtype=float32)
} --------------------
WARNING:tensorflow:Using MirroredStrategy eagerly has significant overhead currently. We will be working on improving this in the future, but for now please wrap `call_for_each_replica` or `experimental_run` or `run` inside a tf.function to get the best performance.
WARNING:tensorflow:Efficient allreduce is not supported for 1 IndexedSlices
-------------------- Step 2, loss: PerReplica:{
  0: tf.Tensor(37.584198, shape=(), dtype=float32),
  1: tf.Tensor(36.131, shape=(), dtype=float32),
  2: tf.Tensor(38.500664, shape=(), dtype=float32),
  3: tf.Tensor(37.32876, shape=(), dtype=float32)
} --------------------
-------------------- Step 3, loss: PerReplica:{
  0: tf.Tensor(5.023567, shape=(), dtype=float32),
  1: tf.Tensor(3.7619786, shape=(), dtype=float32),
  2: tf.Tensor(4.988394, shape=(), dtype=float32),
  3: tf.Tensor(4.648823, shape=(), dtype=float32)
} --------------------
WARNING:tensorflow:5 out of the last 5 calls to <function _apply_all_reduce.<locals>._all_reduce at 0x7fc34c647940> triggered tf.function retracing. Tracing is expensive and the excessive number of tracings could be due to (1) creating @tf.function repeatedly in a loop, (2) passing tensors with different shapes, (3) passing Python objects instead of tensors. For (1), please define your @tf.function outside of the loop. For (2), @tf.function has reduce_retracing=True option that can avoid unnecessary retracing. For (3), please refer to https://www.tensorflow.org/guide/function#controlling_retracing and https://www.tensorflow.org/api_docs/python/tf/function for  more details.
-------------------- Step 4, loss: PerReplica:{
  0: tf.Tensor(1.080203, shape=(), dtype=float32),
  1: tf.Tensor(1.2417698, shape=(), dtype=float32),
  2: tf.Tensor(1.2622243, shape=(), dtype=float32),
  3: tf.Tensor(1.1184206, shape=(), dtype=float32)
} --------------------
WARNING:tensorflow:6 out of the last 6 calls to <function _apply_all_reduce.<locals>._all_reduce at 0x7fc34c647ee0> triggered tf.function retracing. Tracing is expensive and the excessive number of tracings could be due to (1) creating @tf.function repeatedly in a loop, (2) passing tensors with different shapes, (3) passing Python objects instead of tensors. For (1), please define your @tf.function outside of the loop. For (2), @tf.function has reduce_retracing=True option that can avoid unnecessary retracing. For (3), please refer to https://www.tensorflow.org/guide/function#controlling_retracing and https://www.tensorflow.org/api_docs/python/tf/function for  more details.
-------------------- Step 5, loss: PerReplica:{
  0: tf.Tensor(0.654034, shape=(), dtype=float32),
  1: tf.Tensor(0.7189002, shape=(), dtype=float32),
  2: tf.Tensor(0.66333723, shape=(), dtype=float32),
  3: tf.Tensor(0.6037976, shape=(), dtype=float32)
} --------------------
-------------------- Step 6, loss: PerReplica:{
  0: tf.Tensor(0.79754734, shape=(), dtype=float32),
  1: tf.Tensor(0.9231312, shape=(), dtype=float32),
  2: tf.Tensor(0.90430397, shape=(), dtype=float32),
  3: tf.Tensor(0.91203874, shape=(), dtype=float32)
} --------------------
-------------------- Step 7, loss: PerReplica:{
  0: tf.Tensor(0.22423872, shape=(), dtype=float32),
  1: tf.Tensor(0.211602, shape=(), dtype=float32),
  2: tf.Tensor(0.2190841, shape=(), dtype=float32),
  3: tf.Tensor(0.19895837, shape=(), dtype=float32)
} --------------------
-------------------- Step 8, loss: PerReplica:{
  0: tf.Tensor(1.7644451, shape=(), dtype=float32),
  1: tf.Tensor(1.7413795, shape=(), dtype=float32),
  2: tf.Tensor(1.6232728, shape=(), dtype=float32),
  3: tf.Tensor(1.5175638, shape=(), dtype=float32)
} --------------------
-------------------- Step 9, loss: PerReplica:{
  0: tf.Tensor(0.35069197, shape=(), dtype=float32),
  1: tf.Tensor(0.32513526, shape=(), dtype=float32),
  2: tf.Tensor(0.30032104, shape=(), dtype=float32),
  3: tf.Tensor(0.3842827, shape=(), dtype=float32)
} --------------------
You are using the plugin with MirroredStrategy.
=====================================================HPS Parse====================================================
[HCTR][14:16:00.618][INFO][RK0][main]: dense_file is not specified using default: 
[HCTR][14:16:00.618][INFO][RK0][main]: num_of_refresher_buffer_in_pool is not specified using default: 1
[HCTR][14:16:00.618][INFO][RK0][main]: maxnum_des_feature_per_sample is not specified using default: 26
[HCTR][14:16:00.618][INFO][RK0][main]: refresh_delay is not specified using default: 0
[HCTR][14:16:00.618][INFO][RK0][main]: refresh_interval is not specified using default: 0
====================================================HPS Create====================================================
[HCTR][14:16:00.618][INFO][RK0][main]: Creating HashMap CPU database backend...
[HCTR][14:16:00.618][DEBUG][RK0][main]: Created blank database backend in local memory!
[HCTR][14:16:00.619][INFO][RK0][main]: Volatile DB: initial cache rate = 1
[HCTR][14:16:00.619][INFO][RK0][main]: Volatile DB: cache missed embeddings = 0
[HCTR][14:16:00.619][DEBUG][RK0][main]: Created raw model loader in local memory!
[HCTR][14:16:00.760][INFO][RK0][main]: Table: hps_et.dnn.sparse_embedding0; cached 100000 / 100000 embeddings in volatile database (HashMapBackend); load: 100000 / 18446744073709551615 (0.00%).
[HCTR][14:16:00.760][DEBUG][RK0][main]: Real-time subscribers created!
[HCTR][14:16:00.760][INFO][RK0][main]: Creating embedding cache in device 0.
[HCTR][14:16:00.768][INFO][RK0][main]: Model name: dnn
[HCTR][14:16:00.768][INFO][RK0][main]: Number of embedding tables: 1
[HCTR][14:16:00.768][INFO][RK0][main]: Use GPU embedding cache: True, cache size percentage: 1.000000
[HCTR][14:16:00.768][INFO][RK0][main]: Use I64 input key: True
[HCTR][14:16:00.768][INFO][RK0][main]: Configured cache hit rate threshold: 1.000000
[HCTR][14:16:00.768][INFO][RK0][main]: The size of thread pool: 112
[HCTR][14:16:00.768][INFO][RK0][main]: The size of worker memory pool: 3
[HCTR][14:16:00.768][INFO][RK0][main]: The size of refresh memory pool: 1
[HCTR][14:16:00.768][INFO][RK0][main]: The refresh percentage : 0.200000
[HCTR][14:16:00.780][INFO][RK0][main]: Creating embedding cache in device 1.
[HCTR][14:16:00.787][INFO][RK0][main]: Model name: dnn
[HCTR][14:16:00.787][INFO][RK0][main]: Number of embedding tables: 1
[HCTR][14:16:00.787][INFO][RK0][main]: Use GPU embedding cache: True, cache size percentage: 1.000000
[HCTR][14:16:00.787][INFO][RK0][main]: Use I64 input key: True
[HCTR][14:16:00.787][INFO][RK0][main]: Configured cache hit rate threshold: 1.000000
[HCTR][14:16:00.787][INFO][RK0][main]: The size of thread pool: 112
[HCTR][14:16:00.787][INFO][RK0][main]: The size of worker memory pool: 3
[HCTR][14:16:00.787][INFO][RK0][main]: The size of refresh memory pool: 1
[HCTR][14:16:00.787][INFO][RK0][main]: The refresh percentage : 0.200000
[HCTR][14:16:00.790][INFO][RK0][main]: Creating embedding cache in device 2.
[HCTR][14:16:00.796][INFO][RK0][main]: Model name: dnn
[HCTR][14:16:00.796][INFO][RK0][main]: Number of embedding tables: 1
[HCTR][14:16:00.796][INFO][RK0][main]: Use GPU embedding cache: True, cache size percentage: 1.000000
[HCTR][14:16:00.796][INFO][RK0][main]: Use I64 input key: True
[HCTR][14:16:00.796][INFO][RK0][main]: Configured cache hit rate threshold: 1.000000
[HCTR][14:16:00.796][INFO][RK0][main]: The size of thread pool: 112
[HCTR][14:16:00.796][INFO][RK0][main]: The size of worker memory pool: 3
[HCTR][14:16:00.796][INFO][RK0][main]: The size of refresh memory pool: 1
[HCTR][14:16:00.796][INFO][RK0][main]: The refresh percentage : 0.200000
[HCTR][14:16:00.799][INFO][RK0][main]: Creating embedding cache in device 3.
[HCTR][14:16:00.805][INFO][RK0][main]: Model name: dnn
[HCTR][14:16:00.805][INFO][RK0][main]: Number of embedding tables: 1
[HCTR][14:16:00.805][INFO][RK0][main]: Use GPU embedding cache: True, cache size percentage: 1.000000
[HCTR][14:16:00.805][INFO][RK0][main]: Use I64 input key: True
[HCTR][14:16:00.805][INFO][RK0][main]: Configured cache hit rate threshold: 1.000000
[HCTR][14:16:00.805][INFO][RK0][main]: The size of thread pool: 112
[HCTR][14:16:00.805][INFO][RK0][main]: The size of worker memory pool: 3
[HCTR][14:16:00.805][INFO][RK0][main]: The size of refresh memory pool: 1
[HCTR][14:16:00.805][INFO][RK0][main]: The refresh percentage : 0.200000
[HCTR][14:16:00.866][DEBUG][RK0][main]: Created raw model loader in local memory!
[HCTR][14:16:00.869][INFO][RK0][main]: EC initialization for model: "dnn", num_tables: 1
[HCTR][14:16:00.870][INFO][RK0][main]: EC initialization on device: 0
[HCTR][14:16:00.871][INFO][RK0][main]: EC initialization on device: 1
[HCTR][14:16:00.872][INFO][RK0][main]: EC initialization on device: 2
[HCTR][14:16:00.873][INFO][RK0][main]: EC initialization on device: 3
[HCTR][14:16:00.874][INFO][RK0][main]: Creating lookup session for dnn on device: 0
[HCTR][14:16:00.874][INFO][RK0][main]: Creating lookup session for dnn on device: 1
[HCTR][14:16:00.874][INFO][RK0][main]: Creating lookup session for dnn on device: 2
[HCTR][14:16:00.874][INFO][RK0][main]: Creating lookup session for dnn on device: 3
Model: "model_1"
__________________________________________________________________________________________________
 Layer (type)                   Output Shape         Param #     Connected to                     
==================================================================================================
 input_3 (InputLayer)           [(None, 5)]          0           []                               

 sparse_lookup_layer (SparseLoo  (None, 16)          0           ['input_3[0][0]']                
 kupLayer)                                                                                        

 tf.reshape_1 (TFOpLambda)      (None, 160)          0           ['sparse_lookup_layer[0][0]']    

 input_4 (InputLayer)           [(None, 10)]         0           []                               

 tf.concat_1 (TFOpLambda)       (None, 170)          0           ['tf.reshape_1[0][0]',           
                                                                  'input_4[0][0]']                

 new_fc (Dense)                 (None, 1)            171         ['tf.concat_1[0][0]']            

==================================================================================================
Total params: 171
Trainable params: 171
Non-trainable params: 0
__________________________________________________________________________________________________
Traceback (most recent call last):
  File "hps_pretrained_model_training_demo.py", line 307, in <module>
    model = train_with_pretrained_embeddings(args)
  File "hps_pretrained_model_training_demo.py", line 301, in train_with_pretrained_embeddings
    _, loss = strategy.run(_train_step, args=(inputs, labels))
  File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/distribute/distribute_lib.py", line 1312, in run
    return self._extended.call_for_each_replica(fn, args=args, kwargs=kwargs)
  File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/distribute/distribute_lib.py", line 2888, in call_for_each_replica
    return self._call_for_each_replica(fn, args, kwargs)
  File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/distribute/mirrored_strategy.py", line 676, in _call_for_each_replica
    return mirrored_run.call_for_each_replica(
  File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/distribute/mirrored_run.py", line 101, in call_for_each_replica
    return _call_for_each_replica(strategy, fn, args, kwargs)
  File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/distribute/mirrored_run.py", line 283, in _call_for_each_replica
    coord.join(threads)
  File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/training/coordinator.py", line 385, in join
    six.reraise(*self._exc_info_to_raise)
  File "/usr/lib/python3/dist-packages/six.py", line 703, in reraise
    raise value
  File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/training/coordinator.py", line 293, in stop_on_exception
    yield
  File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/distribute/mirrored_run.py", line 386, in run
    self.main_result = self.main_fn(*self.main_args, **self.main_kwargs)
  File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/autograph/impl/api.py", line 595, in wrapper
    return func(*args, **kwargs)
  File "hps_pretrained_model_training_demo.py", line 284, in _train_step
    logit, _ = model(inputs)
  File "/usr/local/lib/python3.8/dist-packages/keras/utils/traceback_utils.py", line 67, in error_handler
    raise e.with_traceback(filtered_tb) from None
  File "hps_pretrained_model_training_demo.py", line 252, in call
    embeddings = tf.reshape(self.sparse_lookup_layer(sp_ids=input_cat, sp_weights = None, combiner=self.combiner),
  File "/usr/local/lib/python3.8/dist-packages/merlin_hps-1.0.0-py3.8-linux-x86_64.egg/hierarchical_parameter_server/core/sparse_lookup_layer.py", line 200, in call
    embeddings = lookup_ops.lookup(
  File "/usr/local/lib/python3.8/dist-packages/merlin_hps-1.0.0-py3.8-linux-x86_64.egg/hierarchical_parameter_server/core/lookup_ops.py", line 98, in lookup
    status = Init(ps_config_file=ps_config_file, global_batch_size=global_batch_size)
  File "/usr/local/lib/python3.8/dist-packages/merlin_hps-1.0.0-py3.8-linux-x86_64.egg/hierarchical_parameter_server/core/initialize.py", line 225, in Init
    _init_results = _init_wrapper(_run_fn, _init_fn, **kwargs)
  File "/tmp/__autograph_generated_file6j_afal8.py", line 12, in tf___init_wrapper
    retval_ = ag__.converted_call(ag__.ld(run_fn), (ag__.ld(init_fn),), dict(kwargs=ag__.ld(kwargs)), fscope)
RuntimeError: Exception encountered when calling layer "sparse_lookup_layer" (type SparseLookupLayer).

in user code:

    File "/usr/local/lib/python3.8/dist-packages/merlin_hps-1.0.0-py3.8-linux-x86_64.egg/hierarchical_parameter_server/core/initialize.py", line 211, in _init_wrapper  *
        return run_fn(init_fn, kwargs=kwargs)

    RuntimeError: Method requires being in cross-replica context, use get_replica_context().merge_call()

Call arguments received by layer "sparse_lookup_layer" (type SparseLookupLayer):
  • sp_ids=<tensorflow.python.framework.sparse_tensor.SparseTensor object at 0x7fc34c6824c0>
  • sp_weights=None
  • name=None
  • combiner='mean'
  • max_norm=None

Environment (please complete the following information):

OS: CentOS-7
DGX-A100
CUDA version: CUDA 11.7 on host os.
Docker image: nvcr.io/nvidia/merlin/merlin-tensorflow:22.09

Additional context This is the only multi-gpu example of hps tensorflow plugin. Is there any detailed guidence of deploying hps tensorflow plugin under multi-gpu environment?

NVIDIA-Merlin / HugeCTR

[BUG] HPS tensorflow plugin, multi-gpu example crashes #362