NVIDIA-Merlin / HugeCTR

HugeCTR is a high efficiency GPU framework designed for Click-Through-Rate (CTR) estimating training
Apache License 2.0
946 stars 200 forks source link

[Question] Calling 'apply_gradients' on sok.experiment.Variable reports Variable not created in the strategy scope #387

Closed Nov11 closed 1 year ago

Nov11 commented 1 year ago

It seems that when model is create inside strategy.scope(), the sok.Variable is not of type 'MirroredVariable', thus tensorflow complains the 'self.param' is not created under strategy scope.

I looked around in experiment/example folder and found no related example that using sok.expertiment with tensorflow distribute strategy.

The question is how to use sok.experiment.Variable with tensorflow optimizer?

Code and error message are shown below.

import numpy as np
import tensorflow as tf
import horovod.tensorflow as hvd
from sparse_operation_kit import experiment as sok
from tensorflow.keras import optimizers

import os
os.environ['CUDA_VISIBLE_DEVICES'] = '0,1'

class SimpleModel(tf.keras.Model):
    def __init__(self) -> None:
        super().__init__()
        self.row = 32
        self.col = 4

        init_value = np.arange(self.row * self.col).reshape(self.row, self.col)

        self.param = sok.Variable(init_value, dtype=tf.float32)  # nopep8

        self.dense = tf.keras.layers.Dense(1)

    def call(self, feats):
        emb = sok.all2all_dense_embedding(self.param, feats)
        pred = self.dense(emb)
        return pred

if __name__ == "__main__":
    hvd.init()

    strategy = tf.distribute.MirroredStrategy()

    with strategy.scope():
        sok.init()

        model = SimpleModel()
        optimizer = optimizers.SGD(learning_rate=1.0)

        @tf.function
        def train_step(feats):
            with tf.GradientTape() as tape:
                pred = model(feats)
                loss = pred

            e, o = sok.filter_variables(model.trainable_variables)
            print(f'e : {e} o : {o}')

            ge, go = tape.gradient(loss, [e, o])
            optimizer.apply_gradients(zip(go, o))
            optimizer.apply_gradients(zip(ge, e), experimental_aggregate_gradients=False) # this line goes wrong

        idx = np.arange(12)

        def fn(ctx):
            return tf.data.Dataset.from_tensor_slices([idx[:6], idx[6:]])
        ds = strategy.distribute_datasets_from_function(fn)

        ii = iter(ds)
        for _ in range(1):
            item = next(ii)
            strategy.run(train_step, args=(item, ))

        print(model.param)
2023-04-11 12:51:03.860307: I tensorflow/core/platform/cpu_feature_guard.cc:194] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  SSE3 SSE4.1 SSE4.2 AVX
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-04-11 12:51:04.012421: I tensorflow/core/util/util.cc:169] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
[INFO]: sparse_operation_kit is imported
WARNING:tensorflow:Please fix your imports. Module tensorflow.python.training.tracking.base has been moved to tensorflow.python.trackable.base. The old module will be deleted in version 2.11.
[SOK INFO] Import /usr/local/lib/python3.8/dist-packages/merlin_sok-1.1.4-py3.8-linux-x86_64.egg/sparse_operation_kit/lib/libsok_experiment.so
2023-04-11 12:51:08.583267: I tensorflow/core/platform/cpu_feature_guard.cc:194] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  SSE3 SSE4.1 SSE4.2 AVX
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-04-11 12:51:16.474860: W tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:42] Overriding orig_value setting because the TF_FORCE_GPU_ALLOW_GROWTH environment variable is set. Original config value was 0.
2023-04-11 12:51:16.474935: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1637] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 30981 MB memory:  -> device: 0, name: Tesla V100-SXM2-32GB, pci bus id: 0000:1a:00.0, compute capability: 7.0
2023-04-11 12:51:16.478640: W tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:42] Overriding orig_value setting because the TF_FORCE_GPU_ALLOW_GROWTH environment variable is set. Original config value was 0.
2023-04-11 12:51:16.478700: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1637] Created device /job:localhost/replica:0/task:0/device:GPU:1 with 30981 MB memory:  -> device: 1, name: Tesla V100-SXM2-32GB, pci bus id: 0000:1b:00.0, compute capability: 7.0
[SOK INFO] Initialize finished, communication tool: horovod
e : [<tf.Variable 'Variable:0' shape=(32, 4) dtype=float32>] o : [MirroredVariable:{
  0: <tf.Variable 'simple_model/dense/kernel:0' shape=(4, 1) dtype=float32>,
  1: <tf.Variable 'simple_model/dense/kernel/replica_1:0' shape=(4, 1) dtype=float32>
}, MirroredVariable:{
  0: <tf.Variable 'simple_model/dense/bias:0' shape=(1,) dtype=float32>,
  1: <tf.Variable 'simple_model/dense/bias/replica_1:0' shape=(1,) dtype=float32>
}]
e : [<tf.Variable 'Variable:0' shape=(32, 4) dtype=float32>] o : [MirroredVariable:{
  0: <tf.Variable 'simple_model/dense/kernel:0' shape=(4, 1) dtype=float32>,
  1: <tf.Variable 'simple_model/dense/kernel/replica_1:0' shape=(4, 1) dtype=float32>
}, MirroredVariable:{
  0: <tf.Variable 'simple_model/dense/bias:0' shape=(1,) dtype=float32>,
  1: <tf.Variable 'simple_model/dense/bias/replica_1:0' shape=(1,) dtype=float32>
}]
Traceback (most recent call last):
  File "mirror.py", line 63, in <module>
    strategy.run(train_step, args=(item, ))
  File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/distribute/distribute_lib.py", line 1315, in run
    return self._extended.call_for_each_replica(fn, args=args, kwargs=kwargs)
  File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/distribute/distribute_lib.py", line 2891, in call_for_each_replica
    return self._call_for_each_replica(fn, args, kwargs)
  File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/distribute/mirrored_strategy.py", line 676, in _call_for_each_replica
    return mirrored_run.call_for_each_replica(
  File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/distribute/mirrored_run.py", line 83, in call_for_each_replica
    return wrapped(args, kwargs)
  File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/util/traceback_utils.py", line 153, in error_handler
    raise e.with_traceback(filtered_tb) from None
  File "/tmp/__autograph_generated_file5vszfocd.py", line 137, in tf__call_for_each_replica
    ag__.if_stmt(ag__.converted_call(ag__.ld(isinstance), (ag__.ld(fn), ag__.ld(def_function).Function), None, fscope), if_body_6, else_body_6, get_state_6, set_state_6, ('_cfer_fn_cache[strategy]', '_cfer_fn_cache[strategy][fn]', 'do_return', 'retval_', 'fn'), 4)
  File "/tmp/__autograph_generated_file5vszfocd.py", line 132, in else_body_6
    retval_ = ag__.converted_call(ag__.ld(_call_for_each_replica), (ag__.ld(strategy), ag__.ld(fn), ag__.ld(args), ag__.ld(kwargs)), None, fscope)
  File "/usr/lib/python3/dist-packages/six.py", line 703, in reraise
    raise value
  File "/usr/local/lib/python3.8/dist-packages/keras/optimizers/optimizer_v2/optimizer_v2.py", line 791, in _distributed_apply
    with distribution.extended.colocate_vars_with(var):
ValueError: in user code:

    File "/usr/lib/python3/dist-packages/six.py", line 703, in reraise
        raise value
    File "/usr/local/lib/python3.8/dist-packages/keras/optimizers/optimizer_v2/optimizer_v2.py", line 791, in _distributed_apply  **
        with distribution.extended.colocate_vars_with(var):

    ValueError: `colocate_vars_with` must only be passed a variable created in this tf.distribute.Strategy.scope(), not: <tf.Variable 'Variable:0' shape=(32, 4) dtype=float32>
kanghui0204 commented 1 year ago

Please use horovod instead mirror strategy , in SOK experiment ,we won't support mirror strategy , only can use horovod run multi card, the horovod example , you can see in this example python scrpit ,run shell scrpit

Nov11 commented 1 year ago

Oh, thank you.