wpf5511 commented 1 year ago

Branch/Tag/Commit

fff55e73445199a0b187bcbc262645f6a750d11c

Docker Image Version

nvidia-tensorflow[horovod]

GPU name

A100

CUDA Driver

11.1

Reproduced Steps

I use the example like: 
python ../examples/tensorflow/bert/bert_example.py \
        --batch_size 32 \
        --max_seq_len 32 \
        --head_number 12 \
        --size_per_head 64 \
        --num_layer 12 \
        --data_type fp16 \
        --test_time 1

I try lot of time the Encoder TF v.s. EFF-FT Cross check is False; Then I use the builded lib libtf_bert.so with my actual data, with same instance tile batch_size, I found not open EFF-FT, the final result score is correct with TF; But when I open remove padding, The result diffed a log(like 0.78 vs 0.55 ) ,and even in the same batch, the inside score is diff, It seems has a bug

byshiue commented 1 year ago

Can you provide the reproduced steps (do you use bert_example.py directly?) and the log printed by the program.

wpf5511 commented 1 year ago

Can you provide the reproduced steps (do you use bert_example.py directly?) and the log printed by the program.

yes，I have used the bert_examply.py first, and because the script checks the output layer diff, I only focus the classification score diff, so I use my actual data, and check the final score diff with the TF; the bert_example.py reproduce log follows:

python ../examples/tensorflow/bert/bert_example.py --batch_size 128 --max_seq_len 256 --head_number 12 --size_per_head 64 --num_layer 4 --data_type fp16 --test_time 1

logs: 2023-03-09 14:15:43.944790: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0 WARNING:tensorflow:Deprecation warnings have been disabled. Set TF_ENABLE_DEPRECATION_WARNINGS=1 to re-enable them. WARNING:root:Limited tf.compat.v2.summary API due to missing TensorBoard installation.

=============== Argument =============== batch_size: 128 num_layer: 4 max_seq_len: 256 head_number: 12 size_per_head: 64 inter_size: 0 data_type: fp16 test_time: 1 int8_mode: 0 avg_seq_len: -1 thread_num: 1

WARNING:tensorflow:From ../examples/tensorflow/bert/bert_example.py:50: The name tf.set_random_seed is deprecated. Please use tf.compat.v1.set_random_seed instead.

WARNING:tensorflow:From /data/user/kuangxiu/FasterTransformer/examples/tensorflow/bert/../../../examples/tensorflow/bert/utils/bert.py:195: The name tf.variable_scope is deprecated. Please use tf.compat.v1.variable_scope instead.

WARNING:tensorflow:From /data/user/kuangxiu/FasterTransformer/examples/tensorflow/bert/../../../examples/tensorflow/bert/utils/bert.py:195: The name tf.AUTO_REUSE is deprecated. Please use tf.compat.v1.AUTO_REUSE instead.

WARNING:tensorflow: The TensorFlow contrib module will not be included in TensorFlow 2.0. For more information, please see:

https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
https://github.com/tensorflow/addons
https://github.com/tensorflow/io (for I/O related ops) If you depend on functionality not listed there, please file an issue.

WARNING:tensorflow: The TensorFlow contrib module will not be included in TensorFlow 2.0. For more information, please see:

https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
https://github.com/tensorflow/addons
https://github.com/tensorflow/io (for I/O related ops) If you depend on functionality not listed there, please file an issue.

WARNING:tensorflow:From ../examples/tensorflow/bert/bert_example.py:103: The name tf.get_collection is deprecated. Please use tf.compat.v1.get_collection instead.

WARNING:tensorflow:From ../examples/tensorflow/bert/bert_example.py:103: The name tf.GraphKeys is deprecated. Please use tf.compat.v1.GraphKeys instead.

2023-03-09 14:15:58.934448: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0 WARNING:tensorflow:From ../examples/tensorflow/bert/bert_example.py:123: The name tf.is_nan is deprecated. Please use tf.math.is_nan instead.

WARNING:tensorflow:From ../examples/tensorflow/bert/bert_example.py:123: The name tf.is_nan is deprecated. Please use tf.math.is_nan instead.

WARNING:tensorflow:From ../examples/tensorflow/bert/bert_example.py:129: The name tf.ConfigProto is deprecated. Please use tf.compat.v1.ConfigProto instead.

WARNING:tensorflow:From ../examples/tensorflow/bert/bert_example.py:131: The name tf.OptimizerOptions is deprecated. Please use tf.compat.v1.OptimizerOptions instead.

WARNING:tensorflow:From ../examples/tensorflow/bert/bert_example.py:133: The name tf.Session is deprecated. Please use tf.compat.v1.Session instead.

2023-03-09 14:15:58.965615: I tensorflow/core/platform/profile_utils/cpu_utils.cc:109] CPU Frequency: 2595120000 Hz 2023-03-09 14:15:58.966448: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x56480ac277a0 initialized for platform Host (this does not guarantee that XLA will be used). Devices: 2023-03-09 14:15:58.966484: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): Host, Default Version 2023-03-09 14:15:58.967744: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcuda.so.1 2023-03-09 14:15:58.990741: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1082] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2023-03-09 14:15:58.992112: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1669] Found device 0 with properties: name: A100-SXM4-40GB major: 8 minor: 0 memoryClockRate(GHz): 1.41 pciBusID: 0000:00:09.0 2023-03-09 14:15:58.992142: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0 2023-03-09 14:15:59.014573: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublas.so.11 2023-03-09 14:15:59.018386: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcufft.so.10 2023-03-09 14:15:59.018672: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcurand.so.10 2023-03-09 14:15:59.019389: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusolver.so.11 2023-03-09 14:15:59.020268: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusparse.so.11 2023-03-09 14:15:59.020409: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudnn.so.8 2023-03-09 14:15:59.020576: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1082] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2023-03-09 14:15:59.022097: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1082] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2023-03-09 14:15:59.023392: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1797] Adding visible gpu devices: 0 2023-03-09 14:15:59.618112: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1209] Device interconnect StreamExecutor with strength 1 edge matrix: 2023-03-09 14:15:59.618169: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1215] 0 2023-03-09 14:15:59.618177: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1228] 0: N 2023-03-09 14:15:59.618466: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1082] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2023-03-09 14:15:59.619848: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1082] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2023-03-09 14:15:59.621186: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1082] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2023-03-09 14:15:59.622476: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1354] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 38166 MB memory) -> physical GPU (device: 0, name: A100-SXM4-40GB, pci bus id: 0000:00:09.0, compute capability: 8.0) 2023-03-09 14:15:59.624518: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x5648242fa760 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices: 2023-03-09 14:15:59.624539: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): A100-SXM4-40GB, Compute Capability 8.0 WARNING:tensorflow:From ../examples/tensorflow/bert/bert_example.py:134: The name tf.global_variables_initializer is deprecated. Please use tf.compat.v1.global_variables_initializer instead.

WARNING:tensorflow:From ../examples/tensorflow/bert/bert_example.py:134: The name tf.global_variables_initializer is deprecated. Please use tf.compat.v1.global_variables_initializer instead.

0 layer_0/attention/self/query/kernel:0 (768, 768) <dtype: 'float16_ref'> 1 layer_0/attention/self/query/bias:0 (768,) <dtype: 'float16_ref'> 2 layer_0/attention/self/key/kernel:0 (768, 768) <dtype: 'float16_ref'> 3 layer_0/attention/self/key/bias:0 (768,) <dtype: 'float16_ref'> 4 layer_0/attention/self/value/kernel:0 (768, 768) <dtype: 'float16_ref'> 5 layer_0/attention/self/value/bias:0 (768,) <dtype: 'float16_ref'> 6 layer_0/attention/output/dense/kernel:0 (768, 768) <dtype: 'float16_ref'> 7 layer_0/attention/output/dense/bias:0 (768,) <dtype: 'float16_ref'> 8 layer_0/attention/output/LayerNorm/beta:0 (768,) <dtype: 'float16_ref'> 9 layer_0/attention/output/LayerNorm/gamma:0 (768,) <dtype: 'float16_ref'> 10 layer_0/intermediate/dense/kernel:0 (768, 3072) <dtype: 'float16_ref'> 11 layer_0/intermediate/dense/bias:0 (3072,) <dtype: 'float16_ref'> 12 layer_0/output/dense/kernel:0 (3072, 768) <dtype: 'float16_ref'> 13 layer_0/output/dense/bias:0 (768,) <dtype: 'float16_ref'> 14 layer_0/output/LayerNorm/beta:0 (768,) <dtype: 'float16_ref'> 15 layer_0/output/LayerNorm/gamma:0 (768,) <dtype: 'float16_ref'> 16 layer_1/attention/self/query/kernel:0 (768, 768) <dtype: 'float16_ref'> 17 layer_1/attention/self/query/bias:0 (768,) <dtype: 'float16_ref'> 18 layer_1/attention/self/key/kernel:0 (768, 768) <dtype: 'float16_ref'> 19 layer_1/attention/self/key/bias:0 (768,) <dtype: 'float16_ref'> 20 layer_1/attention/self/value/kernel:0 (768, 768) <dtype: 'float16_ref'> 21 layer_1/attention/self/value/bias:0 (768,) <dtype: 'float16_ref'> 22 layer_1/attention/output/dense/kernel:0 (768, 768) <dtype: 'float16_ref'> 23 layer_1/attention/output/dense/bias:0 (768,) <dtype: 'float16_ref'> 24 layer_1/attention/output/LayerNorm/beta:0 (768,) <dtype: 'float16_ref'> 25 layer_1/attention/output/LayerNorm/gamma:0 (768,) <dtype: 'float16_ref'> 26 layer_1/intermediate/dense/kernel:0 (768, 3072) <dtype: 'float16_ref'> 27 layer_1/intermediate/dense/bias:0 (3072,) <dtype: 'float16_ref'> 28 layer_1/output/dense/kernel:0 (3072, 768) <dtype: 'float16_ref'> 29 layer_1/output/dense/bias:0 (768,) <dtype: 'float16_ref'> 30 layer_1/output/LayerNorm/beta:0 (768,) <dtype: 'float16_ref'> 31 layer_1/output/LayerNorm/gamma:0 (768,) <dtype: 'float16_ref'> 32 layer_2/attention/self/query/kernel:0 (768, 768) <dtype: 'float16_ref'> 33 layer_2/attention/self/query/bias:0 (768,) <dtype: 'float16_ref'> 34 layer_2/attention/self/key/kernel:0 (768, 768) <dtype: 'float16_ref'> 35 layer_2/attention/self/key/bias:0 (768,) <dtype: 'float16_ref'> 36 layer_2/attention/self/value/kernel:0 (768, 768) <dtype: 'float16_ref'> 37 layer_2/attention/self/value/bias:0 (768,) <dtype: 'float16_ref'> 38 layer_2/attention/output/dense/kernel:0 (768, 768) <dtype: 'float16_ref'> 39 layer_2/attention/output/dense/bias:0 (768,) <dtype: 'float16_ref'> 40 layer_2/attention/output/LayerNorm/beta:0 (768,) <dtype: 'float16_ref'> 41 layer_2/attention/output/LayerNorm/gamma:0 (768,) <dtype: 'float16_ref'> 42 layer_2/intermediate/dense/kernel:0 (768, 3072) <dtype: 'float16_ref'> 43 layer_2/intermediate/dense/bias:0 (3072,) <dtype: 'float16_ref'> 44 layer_2/output/dense/kernel:0 (3072, 768) <dtype: 'float16_ref'> 45 layer_2/output/dense/bias:0 (768,) <dtype: 'float16_ref'> 46 layer_2/output/LayerNorm/beta:0 (768,) <dtype: 'float16_ref'> 47 layer_2/output/LayerNorm/gamma:0 (768,) <dtype: 'float16_ref'> 48 layer_3/attention/self/query/kernel:0 (768, 768) <dtype: 'float16_ref'> 49 layer_3/attention/self/query/bias:0 (768,) <dtype: 'float16_ref'> 50 layer_3/attention/self/key/kernel:0 (768, 768) <dtype: 'float16_ref'> 51 layer_3/attention/self/key/bias:0 (768,) <dtype: 'float16_ref'> 52 layer_3/attention/self/value/kernel:0 (768, 768) <dtype: 'float16_ref'> 53 layer_3/attention/self/value/bias:0 (768,) <dtype: 'float16_ref'> 54 layer_3/attention/output/dense/kernel:0 (768, 768) <dtype: 'float16_ref'> 55 layer_3/attention/output/dense/bias:0 (768,) <dtype: 'float16_ref'> 56 layer_3/attention/output/LayerNorm/beta:0 (768,) <dtype: 'float16_ref'> 57 layer_3/attention/output/LayerNorm/gamma:0 (768,) <dtype: 'float16_ref'> 58 layer_3/intermediate/dense/kernel:0 (768, 3072) <dtype: 'float16_ref'> 59 layer_3/intermediate/dense/bias:0 (3072,) <dtype: 'float16_ref'> 60 layer_3/output/dense/kernel:0 (3072, 768) <dtype: 'float16_ref'> 61 layer_3/output/dense/bias:0 (768,) <dtype: 'float16_ref'> 62 layer_3/output/LayerNorm/beta:0 (768,) <dtype: 'float16_ref'> 63 layer_3/output/LayerNorm/gamma:0 (768,) <dtype: 'float16_ref'> ################################# 2023-03-09 14:16:02.978049: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublas.so.11 [WARNING] gemm_config.in is not found; using default GEMM algo [WARNING] gemm_config.in is not found; using default GEMM algo [INFO] Encoder TF v.s. FT with tensor input Cross check False [INFO] Max diff nan [INFO] min diff nan [INFO] Encoder TF v.s. EFF-FT with tensor input Cross check False [INFO] Max diff 3.1484375 [INFO] min diff 0.0 2023-03-09 14:16:20.238588: I tensorflow/compiler/jit/xla_compilation_cache.cc:241] Compiled cluster using XLA! This line is logged at most once for the lifetime of the process. [WARNING] gemm_config.in is not found; using default GEMM algo [WARNING] gemm_config.in is not found; using default GEMM algo [INFO] batch_size 128 max_seq_len 256 precision FP16 4 layer TF-while-time 22.91 ms ( 50 iterations) [INFO] batch_size 128 max_seq_len 256 precision FP16 4 layer FT-OP-while-time 4.19 ms ( 50 iterations) [INFO] batch_size 128 max_seq_len 256 precision FP16 4 layer EFF-OP-while-time 2.49 ms ( 50 iterations)

byshiue commented 1 year ago

I cannot reproduce this issue. Can you provide the scripts to build the FT? I guess you don't build it correctly because the latencies of FT does not make sense.

The results of my side:

[WARNING] gemm_config.in is not found; using default GEMM algo
[WARNING] gemm_config.in is not found; using default GEMM algo
[INFO] Encoder TF v.s. FT with tensor input Cross check True
[INFO] Max diff 0.01953125
[INFO] min diff 0.0
[INFO] Encoder TF v.s. EFF-FT with tensor input Cross check True
[INFO] Max diff 0.01953125
[INFO] min diff 0.0
2023-03-09 06:32:04.042734: I tensorflow/compiler/jit/xla_compilation_cache.cc:241] Compiled cluster using XLA!  This line is logged at most once for the lifetime of the process.
[WARNING] gemm_config.in is not found; using default GEMM algo
[WARNING] gemm_config.in is not found; using default GEMM algo
[INFO] batch_size 128 max_seq_len 256 precision FP16 4 layer TF-while-time      48.83 ms ( 50 iterations)
[INFO] batch_size 128 max_seq_len 256 precision FP16 4 layer FT-OP-while-time   27.94 ms ( 50 iterations)
[INFO] batch_size 128 max_seq_len 256 precision FP16 4 layer EFF-OP-while-time  14.58 ms ( 50 iterations)

wpf5511 commented 1 year ago

I follow the bert_guide doc, but I not use the docker, use my own tf env, my tf version is nvidia-tensorflow 1.15.5+nv22.12

build command is cmake -DSM=80 -DCMAKE_BUILD_TYPE=Release -DBUILD_TF=ON -DTF_PATH=/usr/local/lib/python3.8/dist-packages/tensorflow_core/ .. make -j12

byshiue commented 1 year ago

What docker image do you use? And can you share the info about nvidia-smi?

Besides, can you try running the program with FT_DEBUG_LEVEL=DEBUG <program>?

wpf5511 commented 1 year ago

I didn't use docker image, I install tf following this: https://blog.csdn.net/qq_33980935/article/details/124091323 I use FT_DEBUG_LEVEL=DEBUG but the log seems the same: [app@qsh4-search-searchbert-1 build]$ FT_DEBUG_LEVEL=DEBUG python ../examples/tensorflow/bert/bert_example.py --batch_size 128 --max_seq_len 256 --head_number 12 --size_per_head 64 --num_layer 4 --data_type fp16 --test_time 1 2023-03-09 17:29:14.824740: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0 WARNING:tensorflow:Deprecation warnings have been disabled. Set TF_ENABLE_DEPRECATION_WARNINGS=1 to re-enable them. WARNING:root:Limited tf.compat.v2.summary API due to missing TensorBoard installation.

=============== Argument =============== batch_size: 128 num_layer: 4 max_seq_len: 256 head_number: 12 size_per_head: 64 inter_size: 0 data_type: fp16 test_time: 1 int8_mode: 0 avg_seq_len: -1 thread_num: 1

WARNING:tensorflow:From ../examples/tensorflow/bert/bert_example.py:50: The name tf.set_random_seed is deprecated. Please use tf.compat.v1.set_random_seed instead.

WARNING:tensorflow:From /data/user/kuangxiu/FasterTransformer/examples/tensorflow/bert/../../../examples/tensorflow/bert/utils/bert.py:195: The name tf.variable_scope is deprecated. Please use tf.compat.v1.variable_scope instead.

WARNING:tensorflow:From /data/user/kuangxiu/FasterTransformer/examples/tensorflow/bert/../../../examples/tensorflow/bert/utils/bert.py:195: The name tf.AUTO_REUSE is deprecated. Please use tf.compat.v1.AUTO_REUSE instead.

WARNING:tensorflow: The TensorFlow contrib module will not be included in TensorFlow 2.0. For more information, please see:

https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
https://github.com/tensorflow/addons
https://github.com/tensorflow/io (for I/O related ops) If you depend on functionality not listed there, please file an issue.

WARNING:tensorflow: The TensorFlow contrib module will not be included in TensorFlow 2.0. For more information, please see:

https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
https://github.com/tensorflow/addons
https://github.com/tensorflow/io (for I/O related ops) If you depend on functionality not listed there, please file an issue.

WARNING:tensorflow:From ../examples/tensorflow/bert/bert_example.py:103: The name tf.get_collection is deprecated. Please use tf.compat.v1.get_collection instead.

WARNING:tensorflow:From ../examples/tensorflow/bert/bert_example.py:103: The name tf.GraphKeys is deprecated. Please use tf.compat.v1.GraphKeys instead.

2023-03-09 17:29:29.743593: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0 WARNING:tensorflow:From ../examples/tensorflow/bert/bert_example.py:123: The name tf.is_nan is deprecated. Please use tf.math.is_nan instead.

WARNING:tensorflow:From ../examples/tensorflow/bert/bert_example.py:123: The name tf.is_nan is deprecated. Please use tf.math.is_nan instead.

WARNING:tensorflow:From ../examples/tensorflow/bert/bert_example.py:129: The name tf.ConfigProto is deprecated. Please use tf.compat.v1.ConfigProto instead.

WARNING:tensorflow:From ../examples/tensorflow/bert/bert_example.py:131: The name tf.OptimizerOptions is deprecated. Please use tf.compat.v1.OptimizerOptions instead.

WARNING:tensorflow:From ../examples/tensorflow/bert/bert_example.py:133: The name tf.Session is deprecated. Please use tf.compat.v1.Session instead.

2023-03-09 17:29:29.774405: I tensorflow/core/platform/profile_utils/cpu_utils.cc:109] CPU Frequency: 2595120000 Hz 2023-03-09 17:29:29.775160: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x5561e8d26e70 initialized for platform Host (this does not guarantee that XLA will be used). Devices: 2023-03-09 17:29:29.775199: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): Host, Default Version 2023-03-09 17:29:29.776282: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcuda.so.1 2023-03-09 17:29:29.817894: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1082] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2023-03-09 17:29:29.819268: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1669] Found device 0 with properties: name: A100-SXM4-40GB major: 8 minor: 0 memoryClockRate(GHz): 1.41 pciBusID: 0000:00:09.0 2023-03-09 17:29:29.819297: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0 2023-03-09 17:29:29.841426: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublas.so.11 2023-03-09 17:29:29.845171: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcufft.so.10 2023-03-09 17:29:29.845452: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcurand.so.10 2023-03-09 17:29:29.846209: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusolver.so.11 2023-03-09 17:29:29.847296: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusparse.so.11 2023-03-09 17:29:29.847443: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudnn.so.8 2023-03-09 17:29:29.847603: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1082] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2023-03-09 17:29:29.982358: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1082] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2023-03-09 17:29:29.983637: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1797] Adding visible gpu devices: 0 2023-03-09 17:29:30.456939: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1209] Device interconnect StreamExecutor with strength 1 edge matrix: 2023-03-09 17:29:30.456996: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1215] 0 2023-03-09 17:29:30.457004: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1228] 0: N 2023-03-09 17:29:30.457283: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1082] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2023-03-09 17:29:30.458651: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1082] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2023-03-09 17:29:30.459974: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1082] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2023-03-09 17:29:30.461281: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1354] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 38166 MB memory) -> physical GPU (device: 0, name: A100-SXM4-40GB, pci bus id: 0000:00:09.0, compute capability: 8.0) 2023-03-09 17:29:30.463341: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x5562023f8500 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices: 2023-03-09 17:29:30.463362: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): A100-SXM4-40GB, Compute Capability 8.0 WARNING:tensorflow:From ../examples/tensorflow/bert/bert_example.py:134: The name tf.global_variables_initializer is deprecated. Please use tf.compat.v1.global_variables_initializer instead.

WARNING:tensorflow:From ../examples/tensorflow/bert/bert_example.py:134: The name tf.global_variables_initializer is deprecated. Please use tf.compat.v1.global_variables_initializer instead.

0 layer_0/attention/self/query/kernel:0 (768, 768) <dtype: 'float16_ref'> 1 layer_0/attention/self/query/bias:0 (768,) <dtype: 'float16_ref'> 2 layer_0/attention/self/key/kernel:0 (768, 768) <dtype: 'float16_ref'> 3 layer_0/attention/self/key/bias:0 (768,) <dtype: 'float16_ref'> 4 layer_0/attention/self/value/kernel:0 (768, 768) <dtype: 'float16_ref'> 5 layer_0/attention/self/value/bias:0 (768,) <dtype: 'float16_ref'> 6 layer_0/attention/output/dense/kernel:0 (768, 768) <dtype: 'float16_ref'> 7 layer_0/attention/output/dense/bias:0 (768,) <dtype: 'float16_ref'> 8 layer_0/attention/output/LayerNorm/beta:0 (768,) <dtype: 'float16_ref'> 9 layer_0/attention/output/LayerNorm/gamma:0 (768,) <dtype: 'float16_ref'> 10 layer_0/intermediate/dense/kernel:0 (768, 3072) <dtype: 'float16_ref'> 11 layer_0/intermediate/dense/bias:0 (3072,) <dtype: 'float16_ref'> 12 layer_0/output/dense/kernel:0 (3072, 768) <dtype: 'float16_ref'> 13 layer_0/output/dense/bias:0 (768,) <dtype: 'float16_ref'> 14 layer_0/output/LayerNorm/beta:0 (768,) <dtype: 'float16_ref'> 15 layer_0/output/LayerNorm/gamma:0 (768,) <dtype: 'float16_ref'> 16 layer_1/attention/self/query/kernel:0 (768, 768) <dtype: 'float16_ref'> 17 layer_1/attention/self/query/bias:0 (768,) <dtype: 'float16_ref'> 18 layer_1/attention/self/key/kernel:0 (768, 768) <dtype: 'float16_ref'> 19 layer_1/attention/self/key/bias:0 (768,) <dtype: 'float16_ref'> 20 layer_1/attention/self/value/kernel:0 (768, 768) <dtype: 'float16_ref'> 21 layer_1/attention/self/value/bias:0 (768,) <dtype: 'float16_ref'> 22 layer_1/attention/output/dense/kernel:0 (768, 768) <dtype: 'float16_ref'> 23 layer_1/attention/output/dense/bias:0 (768,) <dtype: 'float16_ref'> 24 layer_1/attention/output/LayerNorm/beta:0 (768,) <dtype: 'float16_ref'> 25 layer_1/attention/output/LayerNorm/gamma:0 (768,) <dtype: 'float16_ref'> 26 layer_1/intermediate/dense/kernel:0 (768, 3072) <dtype: 'float16_ref'> 27 layer_1/intermediate/dense/bias:0 (3072,) <dtype: 'float16_ref'> 28 layer_1/output/dense/kernel:0 (3072, 768) <dtype: 'float16_ref'> 29 layer_1/output/dense/bias:0 (768,) <dtype: 'float16_ref'> 30 layer_1/output/LayerNorm/beta:0 (768,) <dtype: 'float16_ref'> 31 layer_1/output/LayerNorm/gamma:0 (768,) <dtype: 'float16_ref'> 32 layer_2/attention/self/query/kernel:0 (768, 768) <dtype: 'float16_ref'> 33 layer_2/attention/self/query/bias:0 (768,) <dtype: 'float16_ref'> 34 layer_2/attention/self/key/kernel:0 (768, 768) <dtype: 'float16_ref'> 35 layer_2/attention/self/key/bias:0 (768,) <dtype: 'float16_ref'> 36 layer_2/attention/self/value/kernel:0 (768, 768) <dtype: 'float16_ref'> 37 layer_2/attention/self/value/bias:0 (768,) <dtype: 'float16_ref'> 38 layer_2/attention/output/dense/kernel:0 (768, 768) <dtype: 'float16_ref'> 39 layer_2/attention/output/dense/bias:0 (768,) <dtype: 'float16_ref'> 40 layer_2/attention/output/LayerNorm/beta:0 (768,) <dtype: 'float16_ref'> 41 layer_2/attention/output/LayerNorm/gamma:0 (768,) <dtype: 'float16_ref'> 42 layer_2/intermediate/dense/kernel:0 (768, 3072) <dtype: 'float16_ref'> 43 layer_2/intermediate/dense/bias:0 (3072,) <dtype: 'float16_ref'> 44 layer_2/output/dense/kernel:0 (3072, 768) <dtype: 'float16_ref'> 45 layer_2/output/dense/bias:0 (768,) <dtype: 'float16_ref'> 46 layer_2/output/LayerNorm/beta:0 (768,) <dtype: 'float16_ref'> 47 layer_2/output/LayerNorm/gamma:0 (768,) <dtype: 'float16_ref'> 48 layer_3/attention/self/query/kernel:0 (768, 768) <dtype: 'float16_ref'> 49 layer_3/attention/self/query/bias:0 (768,) <dtype: 'float16_ref'> 50 layer_3/attention/self/key/kernel:0 (768, 768) <dtype: 'float16_ref'> 51 layer_3/attention/self/key/bias:0 (768,) <dtype: 'float16_ref'> 52 layer_3/attention/self/value/kernel:0 (768, 768) <dtype: 'float16_ref'> 53 layer_3/attention/self/value/bias:0 (768,) <dtype: 'float16_ref'> 54 layer_3/attention/output/dense/kernel:0 (768, 768) <dtype: 'float16_ref'> 55 layer_3/attention/output/dense/bias:0 (768,) <dtype: 'float16_ref'> 56 layer_3/attention/output/LayerNorm/beta:0 (768,) <dtype: 'float16_ref'> 57 layer_3/attention/output/LayerNorm/gamma:0 (768,) <dtype: 'float16_ref'> 58 layer_3/intermediate/dense/kernel:0 (768, 3072) <dtype: 'float16_ref'> 59 layer_3/intermediate/dense/bias:0 (3072,) <dtype: 'float16_ref'> 60 layer_3/output/dense/kernel:0 (3072, 768) <dtype: 'float16_ref'> 61 layer_3/output/dense/bias:0 (768,) <dtype: 'float16_ref'> 62 layer_3/output/LayerNorm/beta:0 (768,) <dtype: 'float16_ref'> 63 layer_3/output/LayerNorm/gamma:0 (768,) <dtype: 'float16_ref'> ################################# 2023-03-09 17:29:33.578508: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublas.so.11 [WARNING] gemm_config.in is not found; using default GEMM algo [WARNING] gemm_config.in is not found; using default GEMM algo [INFO] Encoder TF v.s. FT with tensor input Cross check False [INFO] Max diff nan [INFO] min diff nan [INFO] Encoder TF v.s. EFF-FT with tensor input Cross check False [INFO] Max diff 3.7734375 [INFO] min diff 0.0 2023-03-09 17:29:50.786998: I tensorflow/compiler/jit/xla_compilation_cache.cc:241] Compiled cluster using XLA! This line is logged at most once for the lifetime of the process. [WARNING] gemm_config.in is not found; using default GEMM algo [WARNING] gemm_config.in is not found; using default GEMM algo [INFO] batch_size 128 max_seq_len 256 precision FP16 4 layer TF-while-time 22.91 ms ( 50 iterations) [INFO] batch_size 128 max_seq_len 256 precision FP16 4 layer FT-OP-while-time 4.64 ms ( 50 iterations) [INFO] batch_size 128 max_seq_len 256 precision FP16 4 layer EFF-OP-while-time 2.99 ms ( 50 iterations)

byshiue commented 1 year ago

Can you try to reproduce on recommended docker image? Besides, the bert example is only verified on TF 1, can you try TF 1?

NVIDIA / FasterTransformer

NewVersion FasterTransformer Bert EFF-FT vs Encoder TF has lot diff #484

Branch/Tag/Commit

Docker Image Version

GPU name

CUDA Driver

Reproduced Steps

=============== Argument =============== batch_size: 128 num_layer: 4 max_seq_len: 256 head_number: 12 size_per_head: 64 inter_size: 0 data_type: fp16 test_time: 1 int8_mode: 0 avg_seq_len: -1 thread_num: 1

=============== Argument =============== batch_size: 128 num_layer: 4 max_seq_len: 256 head_number: 12 size_per_head: 64 inter_size: 0 data_type: fp16 test_time: 1 int8_mode: 0 avg_seq_len: -1 thread_num: 1