Open shhn1 opened 1 year ago
Have you printed the results of a sample batch? If you can print the amax values here it would be easier to debug.
Have you printed the results of a sample batch? If you can print the amax values here it would be easier to debug.
I use the code below to print the diff of swin_transformer.forward and model.forward:
output_th = model(images) swin_tansformer_output = swin_transformer.forward(images_half) output = model.head(swin_tansformer_output) diff = output - output_th print(diff.mean(), diff.max(), diff.min())
I got the result: [2022-10-31 10:08:53 swin_tiny_patch4_window7_224](infer_swintransformer_int8_op.py 252): INFO Test: [0/391] Time 5.261 (5.261) Loss 8.4919 (8.4919) Acc@1 0.000 (0.000) Acc@5 0.000 (0.000) Mem 3664MB tensor(0.0292, device='cuda:0', dtype=torch.float16) tensor(8.7344, device='cuda:0', dtype=torch.float16) tensor(-13.1328, device='cuda:0', dtype=torch.float16) tensor(0.0300, device='cuda:0', dtype=torch.float16) tensor(7.8203, device='cuda:0', dtype=torch.float16) tensor(-14.0469, device='cuda:0', dtype=torch.float16) tensor(0.0310, device='cuda:0', dtype=torch.float16) tensor(9.0938, device='cuda:0', dtype=torch.float16) tensor(-14.8438, device='cuda:0', dtype=torch.float16) tensor(0.0255, device='cuda:0', dtype=torch.float16) tensor(7.3438, device='cuda:0', dtype=torch.float16) tensor(-13.4844, device='cuda:0', dtype=torch.float16) tensor(0.0290, device='cuda:0', dtype=torch.float16) tensor(7.8672, device='cuda:0', dtype=torch.float16) tensor(-12.5625, device='cuda:0', dtype=torch.float16) tensor(0.0300, device='cuda:0', dtype=torch.float16) tensor(7.7891, device='cuda:0', dtype=torch.float16) tensor(-13.5312, device='cuda:0', dtype=torch.float16) tensor(0.0273, device='cuda:0', dtype=torch.float16) tensor(8.4297, device='cuda:0', dtype=torch.float16) tensor(-15.4844, device='cuda:0', dtype=torch.float16) tensor(0.0293, device='cuda:0', dtype=torch.float16) tensor(8.5703, device='cuda:0', dtype=torch.float16) tensor(-12.8906, device='cuda:0', dtype=torch.float16) tensor(0.0266, device='cuda:0', dtype=torch.float16) tensor(8.7031, device='cuda:0', dtype=torch.float16) tensor(-14.2031, device='cuda:0', dtype=torch.float16) tensor(0.0313, device='cuda:0', dtype=torch.float16) tensor(8.4922, device='cuda:0', dtype=torch.float16) tensor(-12.8594, device='cuda:0', dtype=torch.float16) [2022-10-31 10:08:58 swin_tiny_patch4_window7_224](infer_swintransformer_int8_op.py 252): INFO Test: [10/391] Time 0.501 (0.962) Loss 8.2914 (8.3750) Acc@1 0.781 (0.142) Acc@5 1.562 (0.568) Mem 3667MB tensor(0.0269, device='cuda:0', dtype=torch.float16) tensor(7.9531, device='cuda:0', dtype=torch.float16) tensor(-12.8203, device='cuda:0', dtype=torch.float16) tensor(0.0318, device='cuda:0', dtype=torch.float16) tensor(7.9805, device='cuda:0', dtype=torch.float16) tensor(-13.9531, device='cuda:0', dtype=torch.float16) tensor(0.0268, device='cuda:0', dtype=torch.float16) tensor(8.4531, device='cuda:0', dtype=torch.float16) tensor(-13.4375, device='cuda:0', dtype=torch.float16) tensor(0.0264, device='cuda:0', dtype=torch.float16) tensor(8.2188, device='cuda:0', dtype=torch.float16) tensor(-12.3594, device='cuda:0', dtype=torch.float16) tensor(0.0306, device='cuda:0', dtype=torch.float16) tensor(8.1719, device='cuda:0', dtype=torch.float16) tensor(-15.0156, device='cuda:0', dtype=torch.float16) tensor(0.0266, device='cuda:0', dtype=torch.float16) tensor(7.8047, device='cuda:0', dtype=torch.float16) tensor(-13.6328, device='cuda:0', dtype=torch.float16) tensor(0.0259, device='cuda:0', dtype=torch.float16) tensor(8.2500, device='cuda:0', dtype=torch.float16) tensor(-14.1719, device='cuda:0', dtype=torch.float16) tensor(0.0296, device='cuda:0', dtype=torch.float16) tensor(7.6562, device='cuda:0', dtype=torch.float16) tensor(-13.8281, device='cuda:0', dtype=torch.float16) tensor(0.0284, device='cuda:0', dtype=torch.float16) tensor(8.5234, device='cuda:0', dtype=torch.float16) tensor(-13.1562, device='cuda:0', dtype=torch.float16) tensor(0.0309, device='cuda:0', dtype=torch.float16) tensor(8.4375, device='cuda:0', dtype=torch.float16) tensor(-13.7266, device='cuda:0', dtype=torch.float16)
I suggest you firstly verify the correctness by random data, i.e., sh run_test_int8.sh 32
I noticed that you are using the un-calibrated checkpoint --resume swin_tiny_patch4_window7_224.pth
, where there is no amax values.
The correct way should be, calibrate the checkpoint following the way calib.sh
does, and then do inference/accuracy test by resuming weights from the calibrated one.
I noticed that you are using the un-calibrated checkpoint
--resume swin_tiny_patch4_window7_224.pth
, where there is no amax values. The correct way should be, calibrate the checkpoint following the waycalib.sh
does, and then do inference/accuracy test by resuming weights from the calibrated one.
Thank you for your reply! I have followed the way calib.sh, and got the checkpoint swin_tiny_patch4_window7_224_calib.pth. I adopted your suggestion to verify the correctness by random data first, and I got the result:
=> merge config from Swin-Transformer-Quantization/SwinTransformer/configs/swin/mmdet_swin_patch4_window7.yaml
[warning] Apex amp has been deprecated, please use pytorch amp instead!
[2022-11-02 05:07:51 swin_tiny_patch4_window7_224_calib](infer_swintransformer_int8_op.py 455): INFO Full config saved to output/swin_tiny_patch4_window7_224_calib/default/config.json
[2022-11-02 05:07:51 swin_tiny_patch4_window7_224_calib](infer_swintransformer_int8_op.py 118): INFO Creating model:swin/swin_tiny_patch4_window7_224_calib
/opt/conda/lib/python3.8/site-packages/torch/functional.py:445: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argument. (Triggered internally at /opt/pytorch/pytorch/aten/src/ATen/native/TensorShape.cpp:2156.)
INT8 op time : 6.718916893005371 ms
INT8_torch_output vs INT8_op_output , avg diff : [1.794074 1.7592216 1.7580295 1.7707056 1.7969115 1.7282423 1.7771956
1.7795997 1.7475603 1.726699 1.7648703 1.795705 1.7609065 1.7568272
1.7577996 1.7913203 1.7436801 1.7849294 1.7646389 1.7705957 1.7351171
1.7205573 1.7654655 1.8362962 1.7718558 1.7757478 1.7524724 1.8025124
1.7510034 1.7619454 1.774456 1.7900693] max diff : [7.0341797 7.4873047 7.951416 8.279541 8.451416 7.5842285 7.775635
7.725586 7.9570923 7.2705994 7.6936035 6.859375 7.150635 7.8271484
7.6953735 7.9748535 7.01178 8.201416 7.967041 7.38678 7.531311
7.1904297 7.9748535 7.260742 7.329193 7.0771484 8.1154785 7.670166
7.8487244 7.870117 8.331055 8.545898 ]
Traceback (most recent call last):
File "infer_swintransformer_int8_op.py", line 460, in
It seems that the difference between the results of INT8_torch_output and INT8_op_output is still large.
The device V100
is not supported, since int8 fused MHA kernel only supports sm75/80/86 device, you can try it on T4/A10/A30/A40/A100/RTX 3090 etc.
You can check the compute capability here, we require at least sm>=7.5 to run this.
I will have a try on A10. Thank you!
Branch/Tag/Commit
main
Docker Image Version
nvcr.io/nvidia/pytorch:21.11-py3
GPU name
V100
CUDA Driver
cuda 11.5
Reproduced Steps
https://github.com/NVIDIA/FasterTransformer/issues/357#tasklist-block-659b5239-14b2-41ea-b692-935c90804a71