NVIDIA / TransformerEngine

A library for accelerating Transformer models on NVIDIA GPUs, including using 8-bit floating point (FP8) precision on Hopper and Ada GPUs, to provide better performance with lower memory utilization in both training and inference.
https://docs.nvidia.com/deeplearning/transformer-engine/user-guide/index.html
Apache License 2.0
1.61k stars 256 forks source link

FP8 vs FP16 performance (seq2seq transformer with te.Linear replacing nn.Linear layers) #230

Open vince62s opened 1 year ago

vince62s commented 1 year ago

Here is what I am getting (see below)

FP8 slower than FP16

for FP16, multiples of 16 make things slower than multiple of 8

Am I missing something ?

Batch_size_multiple 16 // Seqlen multiple 16

FP8 (adam) [2023-05-17 22:20:28,534 INFO] Step 100/300000; acc: 16.1; ppl: 6038.0; xent: 8.7; lr: 0.00002; sents: 31328; bsz: 2145/2545/78; 14043/16656 tok/s; 61 sec; [2023-05-17 22:21:06,060 INFO] Step 200/300000; acc: 20.6; ppl: 1059.6; xent: 7.0; lr: 0.00005; sents: 26736; bsz: 2164/2561/67; 23063/27297 tok/s; 99 sec; [2023-05-17 22:21:43,862 INFO] Step 300/300000; acc: 25.3; ppl: 466.3; xent: 6.1; lr: 0.00007; sents: 27760; bsz: 2181/2576/69; 23082/27262 tok/s; 136 sec; [2023-05-17 22:22:21,180 INFO] Step 400/300000; acc: 27.6; ppl: 315.5; xent: 5.8; lr: 0.00010; sents: 24400; bsz: 2138/2526/61; 22912/27074 tok/s; 174 sec; [2023-05-17 22:22:58,740 INFO] Step 500/300000; acc: 30.4; ppl: 236.7; xent: 5.5; lr: 0.00012; sents: 26688; bsz: 2148/2535/67; 22880/27001 tok/s; 211 sec;

FP16 (adam) [2023-05-17 22:24:39,883 INFO] Step 100/300000; acc: 16.2; ppl: 6127.8; xent: 8.7; lr: 0.00002; sents: 31328; bsz: 2145/2545/78; 18771/22265 tok/s; 46 sec; [2023-05-17 22:25:04,966 INFO] Step 200/300000; acc: 20.6; ppl: 1061.8; xent: 7.0; lr: 0.00005; sents: 26736; bsz: 2164/2561/67; 34504/40838 tok/s; 71 sec; [2023-05-17 22:25:30,067 INFO] Step 300/300000; acc: 25.3; ppl: 467.8; xent: 6.1; lr: 0.00007; sents: 27760; bsz: 2181/2576/69; 34760/41057 tok/s; 96 sec; [2023-05-17 22:25:55,069 INFO] Step 400/300000; acc: 27.4; ppl: 320.1; xent: 5.8; lr: 0.00010; sents: 24400; bsz: 2138/2526/61; 34199/40411 tok/s; 121 sec; [2023-05-17 22:26:19,589 INFO] Step 500/300000; acc: 30.1; ppl: 241.5; xent: 5.5; lr: 0.00012; sents: 26688; bsz: 2148/2535/67; 35048/41359 tok/s; 145 sec;

FP16 (fusedadam) [2023-05-17 22:28:29,266 INFO] Step 100/300000; acc: 16.1; ppl: 6160.6; xent: 8.7; lr: 0.00002; sents: 31328; bsz: 2145/2545/78; 20312/24092 tok/s; 42 sec; [2023-05-17 22:28:49,956 INFO] Step 200/300000; acc: 20.6; ppl: 1063.8; xent: 7.0; lr: 0.00005; sents: 26736; bsz: 2164/2561/67; 41830/49509 tok/s; 63 sec; [2023-05-17 22:29:11,128 INFO] Step 300/300000; acc: 25.3; ppl: 468.3; xent: 6.1; lr: 0.00007; sents: 27760; bsz: 2181/2576/69; 41213/48678 tok/s; 84 sec; [2023-05-17 22:29:32,063 INFO] Step 400/300000; acc: 27.4; ppl: 320.2; xent: 5.8; lr: 0.00010; sents: 24400; bsz: 2138/2526/61; 40842/48260 tok/s; 105 sec; [2023-05-17 22:29:52,720 INFO] Step 500/300000; acc: 30.2; ppl: 241.3; xent: 5.5; lr: 0.00012; sents: 26688; bsz: 2148/2535/67; 41603/49095 tok/s; 126 sec;

Batch_size_multiple 8 // Seqlen multiple 8 FP16 (Fusedadam) [2023-05-17 22:32:08,412 INFO] Step 100/300000; acc: 16.0; ppl: 6256.0; xent: 8.7; lr: 0.00002; sents: 34120; bsz: 2337/2766/85; 22346/26446 tok/s; 42 sec; [2023-05-17 22:32:29,029 INFO] Step 200/300000; acc: 20.9; ppl: 1047.4; xent: 7.0; lr: 0.00005; sents: 31128; bsz: 2349/2772/78; 45571/53777 tok/s; 62 sec; [2023-05-17 22:32:49,643 INFO] Step 300/300000; acc: 24.6; ppl: 482.1; xent: 6.2; lr: 0.00007; sents: 26808; bsz: 2346/2776/67; 45523/53867 tok/s; 83 sec; [2023-05-17 22:33:10,198 INFO] Step 400/300000; acc: 27.0; ppl: 326.7; xent: 5.8; lr: 0.00010; sents: 28448; bsz: 2341/2771/71; 45563/53917 tok/s; 104 sec; [2023-05-17 22:33:30,629 INFO] Step 500/300000; acc: 30.0; ppl: 242.5; xent: 5.5; lr: 0.00012; sents: 27072; bsz: 2338/2764/68; 45773/54123 tok/s; 124 sec;

overvalidated commented 1 year ago

Same problem. The only performance gain I got is from a bigger batch size. But implementation problems in Accelerate (model conversion takes much more memory) don't allow to use it.

AaronZLT commented 1 year ago

Here is what I am getting (see below)

FP8 slower than FP16

for FP16, multiples of 16 make things slower than multiple of 8

Am I missing something ?

Batch_size_multiple 16 // Seqlen multiple 16

FP8 (adam) [2023-05-17 22:20:28,534 INFO] Step 100/300000; acc: 16.1; ppl: 6038.0; xent: 8.7; lr: 0.00002; sents: 31328; bsz: 2145/2545/78; 14043/16656 tok/s; 61 sec; [2023-05-17 22:21:06,060 INFO] Step 200/300000; acc: 20.6; ppl: 1059.6; xent: 7.0; lr: 0.00005; sents: 26736; bsz: 2164/2561/67; 23063/27297 tok/s; 99 sec; [2023-05-17 22:21:43,862 INFO] Step 300/300000; acc: 25.3; ppl: 466.3; xent: 6.1; lr: 0.00007; sents: 27760; bsz: 2181/2576/69; 23082/27262 tok/s; 136 sec; [2023-05-17 22:22:21,180 INFO] Step 400/300000; acc: 27.6; ppl: 315.5; xent: 5.8; lr: 0.00010; sents: 24400; bsz: 2138/2526/61; 22912/27074 tok/s; 174 sec; [2023-05-17 22:22:58,740 INFO] Step 500/300000; acc: 30.4; ppl: 236.7; xent: 5.5; lr: 0.00012; sents: 26688; bsz: 2148/2535/67; 22880/27001 tok/s; 211 sec;

FP16 (adam) [2023-05-17 22:24:39,883 INFO] Step 100/300000; acc: 16.2; ppl: 6127.8; xent: 8.7; lr: 0.00002; sents: 31328; bsz: 2145/2545/78; 18771/22265 tok/s; 46 sec; [2023-05-17 22:25:04,966 INFO] Step 200/300000; acc: 20.6; ppl: 1061.8; xent: 7.0; lr: 0.00005; sents: 26736; bsz: 2164/2561/67; 34504/40838 tok/s; 71 sec; [2023-05-17 22:25:30,067 INFO] Step 300/300000; acc: 25.3; ppl: 467.8; xent: 6.1; lr: 0.00007; sents: 27760; bsz: 2181/2576/69; 34760/41057 tok/s; 96 sec; [2023-05-17 22:25:55,069 INFO] Step 400/300000; acc: 27.4; ppl: 320.1; xent: 5.8; lr: 0.00010; sents: 24400; bsz: 2138/2526/61; 34199/40411 tok/s; 121 sec; [2023-05-17 22:26:19,589 INFO] Step 500/300000; acc: 30.1; ppl: 241.5; xent: 5.5; lr: 0.00012; sents: 26688; bsz: 2148/2535/67; 35048/41359 tok/s; 145 sec;

FP16 (fusedadam) [2023-05-17 22:28:29,266 INFO] Step 100/300000; acc: 16.1; ppl: 6160.6; xent: 8.7; lr: 0.00002; sents: 31328; bsz: 2145/2545/78; 20312/24092 tok/s; 42 sec; [2023-05-17 22:28:49,956 INFO] Step 200/300000; acc: 20.6; ppl: 1063.8; xent: 7.0; lr: 0.00005; sents: 26736; bsz: 2164/2561/67; 41830/49509 tok/s; 63 sec; [2023-05-17 22:29:11,128 INFO] Step 300/300000; acc: 25.3; ppl: 468.3; xent: 6.1; lr: 0.00007; sents: 27760; bsz: 2181/2576/69; 41213/48678 tok/s; 84 sec; [2023-05-17 22:29:32,063 INFO] Step 400/300000; acc: 27.4; ppl: 320.2; xent: 5.8; lr: 0.00010; sents: 24400; bsz: 2138/2526/61; 40842/48260 tok/s; 105 sec; [2023-05-17 22:29:52,720 INFO] Step 500/300000; acc: 30.2; ppl: 241.3; xent: 5.5; lr: 0.00012; sents: 26688; bsz: 2148/2535/67; 41603/49095 tok/s; 126 sec;

Batch_size_multiple 8 // Seqlen multiple 8 FP16 (Fusedadam) [2023-05-17 22:32:08,412 INFO] Step 100/300000; acc: 16.0; ppl: 6256.0; xent: 8.7; lr: 0.00002; sents: 34120; bsz: 2337/2766/85; 22346/26446 tok/s; 42 sec; [2023-05-17 22:32:29,029 INFO] Step 200/300000; acc: 20.9; ppl: 1047.4; xent: 7.0; lr: 0.00005; sents: 31128; bsz: 2349/2772/78; 45571/53777 tok/s; 62 sec; [2023-05-17 22:32:49,643 INFO] Step 300/300000; acc: 24.6; ppl: 482.1; xent: 6.2; lr: 0.00007; sents: 26808; bsz: 2346/2776/67; 45523/53867 tok/s; 83 sec; [2023-05-17 22:33:10,198 INFO] Step 400/300000; acc: 27.0; ppl: 326.7; xent: 5.8; lr: 0.00010; sents: 28448; bsz: 2341/2771/71; 45563/53917 tok/s; 104 sec; [2023-05-17 22:33:30,629 INFO] Step 500/300000; acc: 30.0; ppl: 242.5; xent: 5.5; lr: 0.00012; sents: 27072; bsz: 2338/2764/68; 45773/54123 tok/s; 124 sec;

Hi vince62s, could you share your benchmark script for replica the issue? :)

vince62s commented 1 year ago

well I don't know if you really want to check the code, but here is my branch of the FP8 changes. https://github.com/vince62s/OpenNMT-py/tree/fp8 main thing happens here: https://github.com/vince62s/OpenNMT-py/blob/fp8/onmt/model_builder.py#L426-L427 if you want an example of training script: https://github.com/vince62s/OpenNMT-py/blob/fp8/docs/source/examples/wmt17/Translation.md