NVIDIA-Merlin / HugeCTR

HugeCTR is a high efficiency GPU framework designed for Click-Through-Rate (CTR) estimating training
Apache License 2.0
950 stars 200 forks source link

[BUG] sok amp mode error #462

Open Orca-bit opened 1 month ago

Orca-bit commented 1 month ago

Describe the bug

[1,0]<stderr>:Traceback (most recent call last):
[1,0]<stderr>:  File "/ws/HugeCTR/sparse_operation_kit/SOK_DLRM_Benchmark/main.py", line 129, in <module>
[1,0]<stderr>:    trainer = Trainer(
[1,0]<stderr>:  File "/ws/HugeCTR/sparse_operation_kit/SOK_DLRM_Benchmark/trainer.py", line 161, in __init__
[1,0]<stderr>:    self._embedding_optimizer = tf.keras.mixed_precision.LossScaleOptimizer(
[1,0]<stderr>:  File "/usr/local/lib/python3.10/dist-packages/keras/mixed_precision/loss_scale_optimizer.py", line 343, in __call__
[1,0]<stderr>:    raise TypeError(msg)
[1,0]<stderr>:TypeError: "inner_optimizer" must be an instance of `tf.keras.optimizers.Optimizer` or `tf.keras.optimizers.experimental.Optimizer`, but got: <sparse_operation_kit.optimizer.OptimizerWrapperV2 object at 0x7f1b15b44910>.

To Reproduce Steps to reproduce the behavior:

  1. How to build including docker pull & docker run commands
  2. How to run including the JSON config file used

Expected behavior A clear and concise description of what you expected to happen.

Screenshots If applicable, add screenshots to help explain your problem.

Environment (please complete the following information):

Additional context Add any other context about the problem here.

kanghui0204 commented 1 month ago

The optimizer in SOK is not a TensorFlow optimizer, so you cannot wrap it with tf.keras.mixed_precision.LossScaleOptimizer. Instead, you can get the scale value from dense part's optimizer , then adjust the gradients accordingly the scale and input them into the SOK optimizer.