NVIDIA / FasterTransformer

Transformer related optimization, including BERT, GPT
Apache License 2.0
5.82k stars 889 forks source link

When I use int8 Swin-Transformer, accuracy of resumed network on the 50000 test images is 0.1% #357

Open shhn1 opened 1 year ago

shhn1 commented 1 year ago

Branch/Tag/Commit

main

Docker Image Version

nvcr.io/nvidia/pytorch:21.11-py3

GPU name

V100

CUDA Driver

cuda 11.5

Reproduced Steps

1. cd $WORKSPACE/examples/pytorch/swin/Swin-Transformer-Quantization
2. wget https://github.com/SwinTransformer/storage/releases/download/v1.0.0/swin_tiny_patch4_window7_224.pth
3. cd $WORKSPACE/examples/pytorch/swin/Swin-Transformer-Quantization
4. python -m torch.distributed.launch --nproc_per_node 1 --master_port 12345 main.py --calib --cfg SwinTransformer/configs/swin/swin_tiny_patch4_window7_224.yaml --resume swin_tiny_patch4_window7_224.pth --data-path imagenet --num-calib-batch 10 --calib-batchsz 8 --int8-mode 1 --calib-output-path calib-checkpoint
# dataset: ILSVRC2012_img_val
5. bash -x run_test_int8_accuracy.sh 128

after the last command, I got the results:
[2022-10-31 09:44:33 swin_tiny_patch4_window7_224](infer_swintransformer_int8_op.py 254): INFO Test: [0/391]   Time 8.862 (8.862)  Loss 9.5285 (9.5285)    Acc@1 0.000 (0.000) Acc@5 1.562 (1.562) Mem 1444MB
……
[2022-10-31 09:46:49 swin_tiny_patch4_window7_224](infer_swintransformer_int8_op.py 254): INFO Test: [390/391]  Time 0.041 (0.369)  Loss 7.9667 (9.5591)    Acc@1 1.250 (0.100) Acc@5 2.500 (0.464) Mem 1446MB
[2022-10-31 09:46:49 swin_tiny_patch4_window7_224](infer_swintransformer_int8_op.py 261): INFO  * Acc@1 0.100 Acc@5 0.464
[2022-10-31 09:46:49 swin_tiny_patch4_window7_224](infer_swintransformer_int8_op.py 181): INFO Accuracy of resumed network on the 50000 test images: 0.1%

https://github.com/NVIDIA/FasterTransformer/issues/357#tasklist-block-659b5239-14b2-41ea-b692-935c90804a71

Njuapp commented 1 year ago

Have you printed the results of a sample batch? If you can print the amax values here it would be easier to debug.

shhn1 commented 1 year ago

Have you printed the results of a sample batch? If you can print the amax values here it would be easier to debug.

I use the code below to print the diff of swin_transformer.forward and model.forward:

output_th = model(images) swin_tansformer_output = swin_transformer.forward(images_half) output = model.head(swin_tansformer_output) diff = output - output_th print(diff.mean(), diff.max(), diff.min())

I got the result: [2022-10-31 10:08:53 swin_tiny_patch4_window7_224](infer_swintransformer_int8_op.py 252): INFO Test: [0/391] Time 5.261 (5.261) Loss 8.4919 (8.4919) Acc@1 0.000 (0.000) Acc@5 0.000 (0.000) Mem 3664MB tensor(0.0292, device='cuda:0', dtype=torch.float16) tensor(8.7344, device='cuda:0', dtype=torch.float16) tensor(-13.1328, device='cuda:0', dtype=torch.float16) tensor(0.0300, device='cuda:0', dtype=torch.float16) tensor(7.8203, device='cuda:0', dtype=torch.float16) tensor(-14.0469, device='cuda:0', dtype=torch.float16) tensor(0.0310, device='cuda:0', dtype=torch.float16) tensor(9.0938, device='cuda:0', dtype=torch.float16) tensor(-14.8438, device='cuda:0', dtype=torch.float16) tensor(0.0255, device='cuda:0', dtype=torch.float16) tensor(7.3438, device='cuda:0', dtype=torch.float16) tensor(-13.4844, device='cuda:0', dtype=torch.float16) tensor(0.0290, device='cuda:0', dtype=torch.float16) tensor(7.8672, device='cuda:0', dtype=torch.float16) tensor(-12.5625, device='cuda:0', dtype=torch.float16) tensor(0.0300, device='cuda:0', dtype=torch.float16) tensor(7.7891, device='cuda:0', dtype=torch.float16) tensor(-13.5312, device='cuda:0', dtype=torch.float16) tensor(0.0273, device='cuda:0', dtype=torch.float16) tensor(8.4297, device='cuda:0', dtype=torch.float16) tensor(-15.4844, device='cuda:0', dtype=torch.float16) tensor(0.0293, device='cuda:0', dtype=torch.float16) tensor(8.5703, device='cuda:0', dtype=torch.float16) tensor(-12.8906, device='cuda:0', dtype=torch.float16) tensor(0.0266, device='cuda:0', dtype=torch.float16) tensor(8.7031, device='cuda:0', dtype=torch.float16) tensor(-14.2031, device='cuda:0', dtype=torch.float16) tensor(0.0313, device='cuda:0', dtype=torch.float16) tensor(8.4922, device='cuda:0', dtype=torch.float16) tensor(-12.8594, device='cuda:0', dtype=torch.float16) [2022-10-31 10:08:58 swin_tiny_patch4_window7_224](infer_swintransformer_int8_op.py 252): INFO Test: [10/391] Time 0.501 (0.962) Loss 8.2914 (8.3750) Acc@1 0.781 (0.142) Acc@5 1.562 (0.568) Mem 3667MB tensor(0.0269, device='cuda:0', dtype=torch.float16) tensor(7.9531, device='cuda:0', dtype=torch.float16) tensor(-12.8203, device='cuda:0', dtype=torch.float16) tensor(0.0318, device='cuda:0', dtype=torch.float16) tensor(7.9805, device='cuda:0', dtype=torch.float16) tensor(-13.9531, device='cuda:0', dtype=torch.float16) tensor(0.0268, device='cuda:0', dtype=torch.float16) tensor(8.4531, device='cuda:0', dtype=torch.float16) tensor(-13.4375, device='cuda:0', dtype=torch.float16) tensor(0.0264, device='cuda:0', dtype=torch.float16) tensor(8.2188, device='cuda:0', dtype=torch.float16) tensor(-12.3594, device='cuda:0', dtype=torch.float16) tensor(0.0306, device='cuda:0', dtype=torch.float16) tensor(8.1719, device='cuda:0', dtype=torch.float16) tensor(-15.0156, device='cuda:0', dtype=torch.float16) tensor(0.0266, device='cuda:0', dtype=torch.float16) tensor(7.8047, device='cuda:0', dtype=torch.float16) tensor(-13.6328, device='cuda:0', dtype=torch.float16) tensor(0.0259, device='cuda:0', dtype=torch.float16) tensor(8.2500, device='cuda:0', dtype=torch.float16) tensor(-14.1719, device='cuda:0', dtype=torch.float16) tensor(0.0296, device='cuda:0', dtype=torch.float16) tensor(7.6562, device='cuda:0', dtype=torch.float16) tensor(-13.8281, device='cuda:0', dtype=torch.float16) tensor(0.0284, device='cuda:0', dtype=torch.float16) tensor(8.5234, device='cuda:0', dtype=torch.float16) tensor(-13.1562, device='cuda:0', dtype=torch.float16) tensor(0.0309, device='cuda:0', dtype=torch.float16) tensor(8.4375, device='cuda:0', dtype=torch.float16) tensor(-13.7266, device='cuda:0', dtype=torch.float16)

Njuapp commented 1 year ago

I suggest you firstly verify the correctness by random data, i.e., sh run_test_int8.sh 32

Njuapp commented 1 year ago

I noticed that you are using the un-calibrated checkpoint --resume swin_tiny_patch4_window7_224.pth, where there is no amax values. The correct way should be, calibrate the checkpoint following the way calib.sh does, and then do inference/accuracy test by resuming weights from the calibrated one.

shhn1 commented 1 year ago

I noticed that you are using the un-calibrated checkpoint --resume swin_tiny_patch4_window7_224.pth, where there is no amax values. The correct way should be, calibrate the checkpoint following the way calib.sh does, and then do inference/accuracy test by resuming weights from the calibrated one.

Thank you for your reply! I have followed the way calib.sh, and got the checkpoint swin_tiny_patch4_window7_224_calib.pth. I adopted your suggestion to verify the correctness by random data first, and I got the result:

=> merge config from Swin-Transformer-Quantization/SwinTransformer/configs/swin/mmdet_swin_patch4_window7.yaml [warning] Apex amp has been deprecated, please use pytorch amp instead! [2022-11-02 05:07:51 swin_tiny_patch4_window7_224_calib](infer_swintransformer_int8_op.py 455): INFO Full config saved to output/swin_tiny_patch4_window7_224_calib/default/config.json [2022-11-02 05:07:51 swin_tiny_patch4_window7_224_calib](infer_swintransformer_int8_op.py 118): INFO Creating model:swin/swin_tiny_patch4_window7_224_calib /opt/conda/lib/python3.8/site-packages/torch/functional.py:445: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argument. (Triggered internally at /opt/pytorch/pytorch/aten/src/ATen/native/TensorShape.cpp:2156.) INT8 op time : 6.718916893005371 ms INT8_torch_output vs INT8_op_output , avg diff : [1.794074 1.7592216 1.7580295 1.7707056 1.7969115 1.7282423 1.7771956 1.7795997 1.7475603 1.726699 1.7648703 1.795705 1.7609065 1.7568272 1.7577996 1.7913203 1.7436801 1.7849294 1.7646389 1.7705957 1.7351171 1.7205573 1.7654655 1.8362962 1.7718558 1.7757478 1.7524724 1.8025124 1.7510034 1.7619454 1.774456 1.7900693] max diff : [7.0341797 7.4873047 7.951416 8.279541 8.451416 7.5842285 7.775635 7.725586 7.9570923 7.2705994 7.6936035 6.859375 7.150635 7.8271484 7.6953735 7.9748535 7.01178 8.201416 7.967041 7.38678 7.531311 7.1904297 7.9748535 7.260742 7.329193 7.0771484 8.1154785 7.670166 7.8487244 7.870117 8.331055 8.545898 ] Traceback (most recent call last): File "infer_swintransformer_int8_op.py", line 460, in main(args, config) File "infer_swintransformer_int8_op.py", line 170, in main validate_with_random_data(config, args, model_without_ddp) File "/opt/conda/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 28, in decorate_context return func(*args, **kwargs) File "infer_swintransformer_int8_op.py", line 387, in validate_with_random_data assert diff.mean() < 0.04, "[ERROR] SWIN INT8 Op TEST FAIL !" AssertionError: [ERROR] SWIN INT8 Op TEST FAIL !

It seems that the difference between the results of INT8_torch_output and INT8_op_output is still large.

Njuapp commented 1 year ago

The device V100 is not supported, since int8 fused MHA kernel only supports sm75/80/86 device, you can try it on T4/A10/A30/A40/A100/RTX 3090 etc.

Njuapp commented 1 year ago

You can check the compute capability here, we require at least sm>=7.5 to run this.

shhn1 commented 1 year ago

I will have a try on A10. Thank you!