NVIDIA / tensorflow

An Open Source Machine Learning Framework for Everyone
https://developer.nvidia.com/deep-learning-frameworks
Apache License 2.0
968 stars 147 forks source link

TF_ENABLE_AUTO_MIXED_PRECISION has no effect #39

Closed donglinz closed 2 years ago

donglinz commented 2 years ago

I am using tensorflow2.6 inside NGC docker nvcr.io/nvidia/tensorflow:21.10-tf2-py3. Running inference using pre-trained bert model from tensorflow hub https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-2_H-128_A-2/2

After setting TF_ENABLE_AUTO_MIXED_PRECISION=1, seems nothing happened except below warning log:

2021-11-05 08:14:36.094644: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:185] None of the MLIR Optimization Passes are enabled (registered 2)
2021-11-05 08:14:36.111426: W tensorflow/core/grappler/optimizers/meta_optimizer.cc:146] TF_ENABLE_AUTO_MIXED_PRECISION has no effect.
2021-11-05 08:14:36.111448: W tensorflow/core/grappler/optimizers/meta_optimizer.cc:146] TF_ENABLE_AUTO_MIXED_PRECISION has no effect.
2021-11-05 08:14:36.111471: W tensorflow/core/grappler/optimizers/meta_optimizer.cc:146] TF_ENABLE_AUTO_MIXED_PRECISION has no effect.
2021-11-05 08:14:36.111484: W tensorflow/core/grappler/optimizers/meta_optimizer.cc:146] TF_ENABLE_AUTO_MIXED_PRECISION has no effect.
2021-11-05 08:14:36.250207: W tensorflow/core/util/dump_graph.cc:134] Failed to dump before_mark_for_compilation because dump location is not  specified through either TF_DUMP_GRAPH_PREFIX environment variable or function argument.
2021-11-05 08:14:36.251315: W tensorflow/core/util/dump_graph.cc:134] Failed to dump mark_for_compilation because dump location is not  specified through either TF_DUMP_GRAPH_PREFIX environment variable or function argument.
2021-11-05 08:14:36.253102: W tensorflow/core/util/dump_graph.cc:134] Failed to dump mark_for_compilation_annotated because dump location is not  specified through either TF_DUMP_GRAPH_PREFIX environment variable or function argument.
2021-11-05 08:14:36.254376: W tensorflow/core/util/dump_graph.cc:134] Failed to dump before_increase_dynamism_for_auto_jit_pass because dump location is not  specified through either TF_DUMP_GRAPH_PREFIX environment variable or function argument.
2021-11-05 08:14:36.292182: I tensorflow/compiler/xla/service/service.cc:171] XLA service 0x7f000c0092a0 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:

I also checked dumped tensorflow hlo, seems no operations have been transferred into FP16 mode. This feature works well in tensorflow 1.x, does it support in nvidia tensorflow2 as well?

donglinz commented 2 years ago

Update: this model is not support by nvidia tf1 as well due to lack of Einsum op FP16 implementation:

ensorflow.python.framework.errors_impl.NotFoundError: No registered 'Einsum' OpKernel for 'GPU' devices compatible with node node StatefulPartitionedCall/model/bert_encoder/transformer/layer_0/self_attention/key/einsum/Einsum (defined at /usr/local/lib/python3.8/dist-packages/tensorflow_core/python/framework/ops.py:1748) 
         (OpKernel was found, but attributes didn't match) Requested Attributes: N=2, T=DT_HALF, equation="abc,cde->abde", _device="/job:localhost/replica:0/task:0/device:GPU:0"
        .  Registered:  device='GPU'; T in [DT_COMPLEX128]
  device='GPU'; T in [DT_COMPLEX64]
  device='GPU'; T in [DT_DOUBLE]
  device='GPU'; T in [DT_FLOAT]
  device='CPU'; T in [DT_COMPLEX128]
  device='CPU'; T in [DT_COMPLEX64]
  device='CPU'; T in [DT_DOUBLE]
  device='CPU'; T in [DT_FLOAT]

         [[StatefulPartitionedCall/model/bert_encoder/transformer/layer_0/self_attention/key/einsum/Einsum]]

Related issue: https://github.com/NVIDIA/tensorflow/issues/40

donglinz commented 2 years ago

update: In TF2 Automatic Mixed Precision Grappler Pass can be enabled with config.graph_options.rewrite_options.auto_mixed_precision = 1.

Closing this issue.