PaddlePaddle / Paddle

PArallel Distributed Deep LEarning: Machine Learning Framework from Industrial Practice (『飞桨』核心框架,深度学习&机器学习高性能单机、分布式训练和跨平台部署)
http://www.paddlepaddle.org/
Apache License 2.0
22.08k stars 5.55k forks source link

The ci error of QAT INT8 #21700

Closed juncaipeng closed 4 years ago

juncaipeng commented 4 years ago

Linux_Python27

http://ci.paddlepaddle.org/viewLog.html?buildId=243540&buildTypeId=Paddle_PrCiNight

[20:48:59]  [Step 1/1] 197/197 Test #829: test_qat_int8_vgg19_mkldnn ...................***Timeout 600.06 sec
[20:48:59]  [Step 1/1] WARNING: OMP_NUM_THREADS set to 4, not 1. The computation speed will not be optimized if you use data parallel. It will fail if this PaddlePaddle binary is compiled with OpenBlas since OpenBlas does not support multi-threads.
[20:48:59]  [Step 1/1] PLEASE USE OMP_NUM_THREADS WISELY.
[20:48:59]  [Step 1/1] 2019-12-11 20:39:04,011-INFO: QAT FP32 & INT8 prediction run.
[20:48:59]  [Step 1/1] 2019-12-11 20:39:04,012-INFO: QAT model: /root/.cache/inference_demo/int8v2/VGG19_QAT/model
[20:48:59]  [Step 1/1] 2019-12-11 20:39:04,012-INFO: Dataset: /root/.cache/inference_demo/int8v2/data.bin
[20:48:59]  [Step 1/1] 2019-12-11 20:39:04,012-INFO: Batch size: 25
[20:48:59]  [Step 1/1] 2019-12-11 20:39:04,012-INFO: Batch number: 2
[20:48:59]  [Step 1/1] 2019-12-11 20:39:04,012-INFO: Accuracy drop threshold: 0.1.
[20:48:59]  [Step 1/1] 2019-12-11 20:39:04,012-INFO: --- QAT FP32 prediction start ---
[20:48:59]  [Step 1/1] 2019-12-11 20:39:49,275-INFO: batch 1, acc1: 0.9600, acc5: 0.9600, latency: 1499.4636 ms, fps: 0.67
[20:48:59]  [Step 1/1] 2019-12-11 20:40:26,624-INFO: batch 2, acc1: 0.7600, acc5: 0.8800, latency: 1444.2566 ms, fps: 0.69
[20:48:59]  [Step 1/1] 2019-12-11 20:40:27,213-INFO: Total inference run time: 75.52 s
[20:48:59]  [Step 1/1] 2019-12-11 20:40:28,437-INFO: --- QAT INT8 prediction start ---
[20:48:59]  [Step 1/1] 2019-12-11 20:46:51,508-INFO: batch 1, acc1: 0.9200, acc5: 0.9600, latency: 14574.7378 ms, fps: 0.07
[20:48:59]  [Step 1/1] 
[20:48:59]  [Step 1/1] 
[20:48:59]  [Step 1/1] 99% tests passed, 1 tests failed out of 197
[20:48:59]  [Step 1/1] 
[20:48:59]  [Step 1/1] Total Test time (real) = 1655.46 sec
[20:48:59]  [Step 1/1] 
[20:48:59]  [Step 1/1] The following tests FAILED:
[20:48:59]  [Step 1/1]  829 - test_qat_int8_vgg19_mkldnn (Timeout)

http://ci.paddlepaddle.org/viewLog.html?buildId=243543&buildTypeId=Paddle_PrCiNight&tab=buildLog

[21:58:30]  [Step 1/1] 197/197 Test #828: test_qat_int8_vgg16_mkldnn ......................***Timeout 600.02 sec
[21:58:30]  [Step 1/1] WARNING: OMP_NUM_THREADS set to 4, not 1. The computation speed will not be optimized if you use data parallel. It will fail if this PaddlePaddle binary is compiled with OpenBlas since OpenBlas does not support multi-threads.
[21:58:30]  [Step 1/1] PLEASE USE OMP_NUM_THREADS WISELY.
[21:58:30]  [Step 1/1] 2019-12-11 21:48:35,216-INFO: QAT FP32 & INT8 prediction run.
[21:58:30]  [Step 1/1] 2019-12-11 21:48:35,228-INFO: QAT model: /root/.cache/inference_demo/int8v2/VGG16_QAT/model
[21:58:30]  [Step 1/1] Errors while running CTest
[21:58:30]  [Step 1/1] 2019-12-11 21:48:35,229-INFO: Dataset: /root/.cache/inference_demo/int8v2/data.bin
[21:58:30]  [Step 1/1] 2019-12-11 21:48:35,229-INFO: Batch size: 25
[21:58:30]  [Step 1/1] 2019-12-11 21:48:35,229-INFO: Batch number: 2
[21:58:30]  [Step 1/1] 2019-12-11 21:48:35,229-INFO: Accuracy drop threshold: 0.1.
[21:58:30]  [Step 1/1] 2019-12-11 21:48:35,229-INFO: --- QAT FP32 prediction start ---
[21:58:30]  [Step 1/1] 2019-12-11 21:49:13,934-INFO: batch 1, acc1: 0.8800, acc5: 1.0000, latency: 1319.4594 ms, fps: 0.76
[21:58:30]  [Step 1/1] 2019-12-11 21:49:37,065-INFO: batch 2, acc1: 0.7600, acc5: 0.9200, latency: 895.3126 ms, fps: 1.12
[21:58:30]  [Step 1/1] 2019-12-11 21:49:37,399-INFO: Total inference run time: 56.59 s
[21:58:30]  [Step 1/1] 2019-12-11 21:49:37,584-INFO: --- QAT INT8 prediction start ---
[21:58:30]  [Step 1/1] 2019-12-11 21:53:52,716-INFO: batch 1, acc1: 0.8800, acc5: 1.0000, latency: 10008.3796 ms, fps: 0.10
[21:58:30]  [Step 1/1] 
[21:58:30]  [Step 1/1] 
[21:58:30]  [Step 1/1] 99% tests passed, 1 tests failed out of 197
[21:58:30]  [Step 1/1] 
[21:58:30]  [Step 1/1] Total Test time (real) = 1456.46 sec
[21:58:30]  [Step 1/1] 
[21:58:30]  [Step 1/1] The following tests FAILED:
[21:58:30]  [Step 1/1]  828 - test_qat_int8_vgg16_mkldnn (Timeout)

http://ci.paddlepaddle.org/viewLog.html?buildId=243542&buildTypeId=Paddle_PrCiNight

[21:11:23]  [Step 1/1] 196/197 Test #824: test_qat_int8_resnet101_mkldnn ................***Timeout 600.04 sec
[21:11:23]  [Step 1/1] WARNING: OMP_NUM_THREADS set to 4, not 1. The computation speed will not be optimized if you use data parallel. It will fail if this PaddlePaddle binary is compiled with OpenBlas since OpenBlas does not support multi-threads.
[21:11:23]  [Step 1/1] PLEASE USE OMP_NUM_THREADS WISELY.
[21:11:23]  [Step 1/1] 2019-12-11 21:01:30,281-INFO: QAT FP32 & INT8 prediction run.
[21:11:23]  [Step 1/1] 2019-12-11 21:01:30,282-INFO: QAT model: /root/.cache/inference_demo/int8v2/ResNet101_QAT/model
[21:11:23]  [Step 1/1] 2019-12-11 21:01:30,282-INFO: Dataset: /root/.cache/inference_demo/int8v2/data.bin
[21:11:23]  [Step 1/1] 2019-12-11 21:01:30,282-INFO: Batch size: 25
[21:11:23]  [Step 1/1] 2019-12-11 21:01:30,282-INFO: Batch number: 2
[21:11:23]  [Step 1/1] 2019-12-11 21:01:30,282-INFO: Accuracy drop threshold: 0.1.
[21:11:23]  [Step 1/1] 2019-12-11 21:01:30,282-INFO: --- QAT FP32 prediction start ---
[21:11:23]  [Step 1/1] 2019-12-11 21:02:51,613-INFO: batch 1, acc1: 0.9200, acc5: 0.9600, latency: 2907.4502 ms, fps: 0.34
[21:11:23]  [Step 1/1] 2019-12-11 21:03:59,060-INFO: batch 2, acc1: 0.7200, acc5: 0.9200, latency: 2649.2588 ms, fps: 0.38
[21:11:23]  [Step 1/1] 2019-12-11 21:04:01,369-INFO: Total inference run time: 142.58 s
[21:11:23]  [Step 1/1] 2019-12-11 21:04:02,185-INFO: --- QAT INT8 prediction start ---
[21:11:23]  [Step 1/1] 2019-12-11 21:09:53,220-INFO: batch 1, acc1: 0.8400, acc5: 1.0000, latency: 8828.3635 ms, fps: 0.11
[21:11:23]  [Step 1/1] 

http://ci.paddlepaddle.org/viewLog.html?buildId=243542&buildTypeId=Paddle_PrCiNight

[21:12:21]  [Step 1/1] 197/197 Test #828: test_qat_int8_vgg16_mkldnn ....................***Timeout 600.00 sec
[21:12:21]  [Step 1/1] WARNING: OMP_NUM_THREADS set to 4, not 1. The computation speed will not be optimized if you use data parallel. It will fail if this PaddlePaddle binary is compiled with OpenBlas since OpenBlas does not support multi-threads.
[21:12:21]  [Step 1/1] PLEASE USE OMP_NUM_THREADS WISELY.
[21:12:21]  [Step 1/1] 2019-12-11 21:02:30,842-INFO: QAT FP32 & INT8 prediction run.
[21:12:21]  [Step 1/1] 2019-12-11 21:02:30,842-INFO: QAT model: /root/.cache/inference_demo/int8v2/VGG16_QAT/model
[21:12:21]  [Step 1/1] 2019-12-11 21:02:30,843-INFO: Dataset: /root/.cache/inference_demo/int8v2/data.bin
[21:12:21]  [Step 1/1] 2019-12-11 21:02:30,843-INFO: Batch size: 25
[21:12:21]  [Step 1/1] 2019-12-11 21:02:30,843-INFO: Batch number: 2
[21:12:21]  [Step 1/1] 2019-12-11 21:02:30,843-INFO: Accuracy drop threshold: 0.1.
[21:12:21]  [Step 1/1] 2019-12-11 21:02:30,867-INFO: --- QAT FP32 prediction start ---
[21:12:21]  [Step 1/1] 2019-12-11 21:04:00,514-INFO: batch 1, acc1: 0.8800, acc5: 1.0000, latency: 3297.3985 ms, fps: 0.30
[21:12:21]  [Step 1/1] 2019-12-11 21:04:58,442-INFO: batch 2, acc1: 0.7600, acc5: 0.9200, latency: 2206.2243 ms, fps: 0.45
[21:12:21]  [Step 1/1] 2019-12-11 21:05:00,775-INFO: Total inference run time: 142.84 s
[21:12:21]  [Step 1/1] 2019-12-11 21:05:06,707-INFO: --- QAT INT8 prediction start ---
[21:12:21]  [Step 1/1] 2019-12-11 21:10:52,863-INFO: batch 1, acc1: 0.8800, acc5: 1.0000, latency: 12618.6094 ms, fps: 0.08
[21:12:21]  [Step 1/1] 
[21:12:21]  [Step 1/1] 
[21:12:21]  [Step 1/1] 99% tests passed, 2 tests failed out of 197
[21:12:21]  [Step 1/1] 
[21:12:21]  [Step 1/1] Total Test time (real) = 2072.77 sec
[21:12:21]  [Step 1/1] 
[21:12:21]  [Step 1/1] The following tests FAILED:
[21:12:21]  [Step 1/1]  824 - test_qat_int8_resnet101_mkldnn (Timeout)
[21:12:21]  [Step 1/1]  828 - test_qat_int8_vgg16_mkldnn (Timeout)

PR_CI_Manylinux_Coverage_CPU

http://ci.paddlepaddle.org/viewLog.html?buildId=243523&buildTypeId=Paddle_PaddleManylinux_PrCiManylinuxCoverageCpu

[17:13:59]  [Step 1/1] 92/93 Test #783: test_qat_int8_resnet101_mkldnn ............***Failed   36.94 sec
[17:13:59]  [Step 1/1] WARNING: OMP_NUM_THREADS set to 4, not 1. The computation speed will not be optimized if you use data parallel. It will fail if this PaddlePaddle binary is compiled with OpenBlas since OpenBlas does not support multi-threads.
[17:13:59]  [Step 1/1] PLEASE USE OMP_NUM_THREADS WISELY.
[17:13:59]  [Step 1/1] 2019-12-11 17:13:24,458-INFO: QAT FP32 & INT8 prediction run.
[17:13:59]  [Step 1/1] 2019-12-11 17:13:24,458-INFO: QAT model: /root/.cache/inference_demo/int8v2/ResNet101_QAT/model
[17:13:59]  [Step 1/1] 2019-12-11 17:13:24,458-INFO: Dataset: /root/.cache/inference_demo/int8v2/data.bin
[17:13:59]  [Step 1/1] 2019-12-11 17:13:24,458-INFO: Batch size: 25
[17:13:59]  [Step 1/1] 2019-12-11 17:13:24,458-INFO: Batch number: 2
[17:13:59]  [Step 1/1] 2019-12-11 17:13:24,458-INFO: Accuracy drop threshold: 0.1.
[17:13:59]  [Step 1/1] 2019-12-11 17:13:24,458-INFO: --- QAT FP32 prediction start ---
[17:13:59]  [Step 1/1] 2019-12-11 17:13:42,194-INFO: batch 1, acc1: 0.9200, acc5: 0.9600, latency: 640.9658 ms, fps: 1.56
[17:13:59]  [Step 1/1] 2019-12-11 17:13:51,462-INFO: batch 2, acc1: 0.7200, acc5: 0.9200, latency: 352.2008 ms, fps: 2.84
[17:13:59]  [Step 1/1] 2019-12-11 17:13:51,716-INFO: Total inference run time: 25.62 s
[17:13:59]  [Step 1/1] 2019-12-11 17:13:51,792-INFO: --- QAT INT8 prediction start ---
[17:13:59]  [Step 1/1] Child killed
[17:14:08]  [Step 1/1] The following tests FAILED:
[17:14:08]  [Step 1/1]  783 - test_qat_int8_resnet101_mkldnn (Failed)
[17:14:09]  [Step 1/1] 90/93 Test #803: test_compressor ...............................***Failed    1.31 sec
[17:14:09]  [Step 1/1] Traceback (most recent call last):
[17:14:09]  [Step 1/1]   File "test_compressor.py", line 15, in <module>
[17:14:09]  [Step 1/1]     import paddle
[17:14:09]  [Step 1/1]   File "/paddle/build/python/paddle/__init__.py", line 30, in <module>
[17:14:09]  [Step 1/1]     import paddle.dataset
[17:14:09]  [Step 1/1]   File "/paddle/build/python/paddle/dataset/__init__.py", line 25, in <module>
[17:14:09]  [Step 1/1]     import paddle.dataset.sentiment
[17:14:09]  [Step 1/1]   File "/paddle/build/python/paddle/dataset/sentiment.py", line 29, in <module>
[17:14:09]  [Step 1/1]     import nltk
[17:14:09]  [Step 1/1]   File "/usr/local/python2.7.15/lib/python2.7/site-packages/nltk/__init__.py", line 129, in <module>
[17:14:09]  [Step 1/1]     from nltk.collocations import *
[17:14:09]  [Step 1/1]   File "/usr/local/python2.7.15/lib/python2.7/site-packages/nltk/collocations.py", line 40, in <module>
[17:14:09]  [Step 1/1]     from nltk.metrics import ContingencyMeasures, BigramAssocMeasures, TrigramAssocMeasures
[17:14:09]  [Step 1/1]   File "/usr/local/python2.7.15/lib/python2.7/site-packages/nltk/metrics/__init__.py", line 16, in <module>
[17:14:09]  [Step 1/1]     from nltk.metrics.scores import (
[17:14:09]  [Step 1/1]   File "/usr/local/python2.7.15/lib/python2.7/site-packages/nltk/metrics/scores.py", line 18, in <module>
[17:14:09]  [Step 1/1]     from scipy.stats.stats import betai
[17:14:09]  [Step 1/1]   File "/usr/local/python2.7.15/lib/python2.7/site-packages/scipy/stats/__init__.py", line 367, in <module>
[17:14:09]  [Step 1/1]     from .stats import *
[17:14:09]  [Step 1/1]   File "/usr/local/python2.7.15/lib/python2.7/site-packages/scipy/stats/stats.py", line 173, in <module>
[17:14:09]  [Step 1/1]     from . import distributions
[17:14:09]  [Step 1/1]   File "/usr/local/python2.7.15/lib/python2.7/site-packages/scipy/stats/distributions.py", line 10, in <module>
[17:14:09]  [Step 1/1]     from ._distn_infrastructure import (entropy, rv_discrete, rv_continuous,
[17:14:09]  [Step 1/1]   File "/usr/local/python2.7.15/lib/python2.7/site-packages/scipy/stats/_distn_infrastructure.py", line 16, in <module>
[17:14:09]  [Step 1/1]     from scipy.misc import doccer
[17:14:09]  [Step 1/1]   File "/usr/local/python2.7.15/lib/python2.7/site-packages/scipy/misc/__init__.py", line 100, in <module>
[17:14:09]  [Step 1/1]     from .pilutil import *
[17:14:09]  [Step 1/1]   File "/usr/local/python2.7.15/lib/python2.7/site-packages/scipy/misc/pilutil.py", line 19, in <module>
[17:14:09]  [Step 1/1]     from PIL import Image, ImageFilter
[17:14:09]  [Step 1/1]   File "/usr/local/python2.7.15/lib/python2.7/site-packages/PIL/Image.py", line 151, in <module>
[17:14:09]  [Step 1/1]     from pathlib2 import Path
[17:14:09]  [Step 1/1]   File "/usr/local/python2.7.15/lib/python2.7/site-packages/pathlib2/__init__.py", line 52, in <module>
[17:14:09]  [Step 1/1]     from scandir import scandir as os_scandir
[17:14:09]  [Step 1/1]   File "/usr/local/python2.7.15/lib/python2.7/site-packages/scandir.py", line 452, in <module>
[17:14:09]  [Step 1/1]     libc = ctypes.CDLL(ctypes.util.find_library('c'), use_errno=True)
[17:14:09]  [Step 1/1]   File "/usr/local/python2.7.15/lib/python2.7/ctypes/util.py", line 274, in find_library
[17:14:09]  [Step 1/1]     return _findSoname_ldconfig(name) or _get_soname(_findLib_gcc(name))
[17:14:09]  [Step 1/1]   File "/usr/local/python2.7.15/lib/python2.7/ctypes/util.py", line 103, in _findLib_gcc
[17:14:09]  [Step 1/1]     stdout=subprocess.PIPE)
[17:14:09]  [Step 1/1]   File "/usr/local/python2.7.15/lib/python2.7/subprocess.py", line 394, in __init__
[17:14:09]  [Step 1/1]     errread, errwrite)
[17:14:09]  [Step 1/1]   File "/usr/local/python2.7.15/lib/python2.7/subprocess.py", line 938, in _execute_child
[17:14:09]  [Step 1/1]     self.pid = os.fork()
[17:14:09]  [Step 1/1] OSError: [Errno 12] Cannot allocate memory
luotao1 commented 4 years ago

Maybe timeout is the same problem in https://github.com/PaddlePaddle/Paddle/pull/20162#issuecomment-554262103

juncaipeng commented 4 years ago

The cpu information of the test machine obtained by lscpu:

flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush 
mmx fxsr sse sse2 ss syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon rep_good 
nopl eagerfpu pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt 
tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch arat 
fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx avx512f rdseed adx 
smap avx512cd xsaveopt xsavec xgetbv1
bingyanghuang commented 4 years ago

@grygielski It seems that this machine doesn't have the right AVX512 flages needed for mkldnn 1.0 int8. could you help confirm this?

grygielski commented 4 years ago

It definitely doesn't have VNNI flags so best implementations won't be chosen. However, there are avx512 flags so it shouldn't timeout so easily. It can be a problem similar to earlier ones with not propagating these flags. Is it virtual machine? Verbose logs would help here. @juncaipeng Could you run one of these tests with MKLDNN_VERBOSE=2 env. variable? If you are using bash just type this before test command. For example: MKLDNN_VERBOSE=2 ctest -R ...